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1  Executive  Summary 


This  is  the  final  report  of  a  project  entitled  “Multiple  Knowledge  Sources  for  Speech 
Recognition",  sponsored  by  the  Defense  Advanced  Research  Projects  Agency  (DARPA) 
and  monitored  by  SPAWAR  under  Contract  N00039-85-C-0423.  The  project  spanned  the 
period  30  April  1985  to  31  July  1989. 

The  objective  of  this  project  has  been  to  develop  methods  and  techniques  to  coor¬ 
dinate  the  many  sources  of  knowledge  in  the  decision  process  for  a  speech  recognition 
system.  This  effort  includes  finding  methods  for  effectively  combining  information  from 
various  knowledge  sources,  and  for  developing  recognition  search  strategies  that  find  the 
most  likely  word  sequence,  given  the  input  speech.  These  search  strategies  must  consider 
a  very  large  number  of  word-sequence  hypotheses  in  a  computationally  efficient  manner. 
To  develop  and  demonstrate  these  techniques,  we  designed  and  implemented  a  complete 
word  recognition  system  for  continuous  speech  which  is  capable  of  incorporating  knowl¬ 
edge  from  several  sources,  including  lexical,  phonetic,  phonological,  and  grammatical 
knowledge.  The  complete  system  was  given  the  name  BYBLOS.  the  name  of  an  ancient 
Phoenician  town  where  the  first  phonetic  writing  was  discovered.  The  BYBLOS  system 
has  been  used  as  our  testbed  system  for  evaluating  various  speech  recognition  algorithms 
and  search  strategies. 

In  addition  to  developing  algorithms  for  combining  multiple  knowledge  sources 
and  efficient  search  strategies,  this  project  also  dealt  with  several  other  issues,  including: 
specification  of  the  Resource  Management  Database  and  documentation  of  how  to  test 
with  it,  periodic  testing  to  meet  the  agreed  upon  test  requirements,  development  of  several 
standard  language  models  for  evaluation,  development  of  a  technique  for  estimation  of 
statistical  language  models  from  limited  text  corpora,  and  testing  of  ideas  for  speaker 
adaptation.  These  topics  are  discussed  in  more  detail  in  the  body  of  the  report  and  in  a 
number  of  papers  that  have  been  attached  to  this  report  as  an  Appendix. 
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2  Introduction 


The  most  fundamental  problem  in  speech  recognition  is  to  develop  an  accurate  model  of 
the  acoustic  signal  that  corresponds  to  any  sequence  of  words  or  phonemes,  in  order  that 
speech  recognition  can  be  performed.  However,  there  are  several  other  sources  of  knowl¬ 
edge  that  can  be  used  to  improve  speech  recognition  accuracy.  Some  of  these  include: 
a  phonetic  lexicon  specifying  the  most  likely  pronunciations  for  each  word,  extended  by 
a  set  of  phonological  rules  suggesting  alternate  pronunciations,  a  model  of  likely  word 
sequences  -  based  either  on  a  heuristic  model  derived  from  rules,  a  statistically-based 
model  derived  by  estimating  probabilities  from  a  training  set,  or  a  linguistically-based 
model  that  uses  syntactic  and  semantic  information  explicitly. 

The  objective  of  this  project  was  to  develop  methods  and  techniques  to  coordinate 
the  many  sources  of  knowledge  in  the  decision  process  for  a  speech  recognition  system. 
This  effort  included  developing  methods  for  effectively  combining  information  from  the 
various  knowledge  sources,  and  methods  for  recognition  search  strategies  that  efficiently 
consider  the  tremendous  number  of  hypotheses  in  the  search  space.  To  develop  and 
demonstrate  these  techniques,  we  designed  and  implemented  a  word  recognition  system 
for  continuous  speech  input  that  employed  several  knowledge  sources.  The  system, 
which  we  called  BYBLOS,  was  then  used  as  a  testbed  system  for  evaluating  various 
recognition  algorithms  and  search  strategies.  Much  of  the  testing  of  the  system  used  the 
DARPA  Resource  Management  Task,  which  was  taken  from  the  Navy  battle  management 
(FCCBMP)  domain. 

The  system  that  was  developed  was  based  on  the  continuous  speech  phonetic  recog¬ 
nition  algorithm  that  had  been  developed  in  our  program  of  basic  research  in  continuous 
speech  recognition  for  DARPA.  In  that  work,  the  model  for  each  word  is  derived  from 
a  set  of  pronunciations  from  a  dictionary,  a  set  of  phonological  rules,  and  from  data 
taken  from  natural  continuous  speech.  Each  phonetic  unit  within  a  word  is  represented 
by  a  combination  of  a  context-independent  model  and  several  context-dependent  models 
of  that  phoneme.  The  training  algorithm  that  was  developed  does  not  require  that  any 
speech  be  labeled  manually.  The  training  data  only  needs  to  be  transcribed  with  a  list  of 
the  words  spoken,  thus  greatly  reducing  the  amount  of  labor  required  and  increasing  the 
amount  of  data  that  can  be  made  available  for  training. 

In  addition  to  developing  algorithms  for  combining  multiple  knowledge  sources  and 
efficient  search  strategies,  this  project  also  dealt  with  several  other  issues.  This  included: 
implementing  a  basic  testbed  system  for  evaluating  different  word  recognition  algonthms, 
specification  of  the  Resource  Management  Database  and  documentation  of  how  to  test 
with  it.  periodic  testing  to  meet  the  agreed  upon  test  requirements,  and  testing  of  ideas 
for  speaker  adaptation. 

The  chapters  of  this  report  are  organized  by  topic,  as  follows: 
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The  BYBLOS  System 

Lexical  and  Phonological  Knowledge 

Language  Models 

Search  Strategies 

System  Implementation 

Database  Specification  and  Documentation 

Testing  System  Performance  and  Demonstrations 

Speaker  Adaptation. 

In  each  of  the  chapters,  we  recount  the  major  areas  of  research  under  the  topic  of 
that  chapter.  Where  applicable,  we  also  review  the  major  technical  principles  involved. 
Further  details  can  be  found  in  the  set  of  papers  included  in  the  Appendix  at  the  end  of 
the  report. 
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3  The  BYBLOS  System 


Figure  1  is  a  block  diagram  of  the  BBN  BYBLOS  continuous  speech  recognition  system. 
We  show  the  different  modules  and  knowledge  sources  (KS)  that  comprise  the  complete 
system,  the  arrows  indicating  the  flow  of  module/KS  interactions.  The  modules  are 
represented  by  rectangular  boxes.  They  are.  starting  from  the  top:  Trainer,  Word  Model 
Generator,  and  Decoder.  Also  shown  are  the  knowledge  sources,  which  are  represented 
by  the  elipses.  They  include:  Acoustic-Phonetic.  Lexical,  and  Grammatic  knowledge 
sources.  We  describe  briefly  the  various  modules  and  how  they  interact  with  the  various 
KSs.  Additional  information  is  given  in  the  body  of  the  report  and  in  the  Appendix. 

Acoustic-Phonetic  Knowledge  Source 

The  Trainer  module  is  used  for  the  acquisition  of  the  acoustic-phoneuc  knowledge 
source.  It  takes  as  input  a  phonetic  dictionary,  speech  to  be  used  for  training  and  the 
corresponding  text  transcnption.  and  produces  a  database  of  context-dependent  hidden 
Markov  models  of  phonemes. 

Lexical  Knowledge  Source 

The  Word  Model  Generator  module  takes  as  input  the  phonetic  models  database  and 
compiles  word  phonetic  models,  using  'he  dictionary  as  another  input.  The  dictionary  is 
the  lexical  knowledge  source,  in  which  phonological  rules  of  English  are  used  to  represent 
each  lexical  item  in  terms  of  their  most  likely  phonetic  spellings.  The  lexical  KS  imposes 
phonotactic  contraints  by  allowing  only  legal  sequences  of  phonemes  to  be  hypothesized 
in  the  recognizer,  reducing  the  search  space  and  improving  performance.  The  output  of 
the  Word  Model  Generator  is  a  database  of  word  models  used  in  the  recognizer. 

Grammatical  Knowledge  Source 

In  much  of  our  speech  recognition  work,  we  use  a  statistical  language  model  to 
represent  grammatical  constraints.  Such  models  allow  all  word  possibilities  but  with 
different  piobabilities  such  that  the  perplexity  of  the  grammar  is  substantially  lower  than 
the  size  of  the  vocabulary  The  recognition  search  process  then  uses  the  word  phonetic- 
models  and  the  statistical  grammar  to  And  the  most  likely  sequence  of  words,  given  the 
input  speech. 
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Figure  1:  A  block  diagram  of  the  BBN  BYBLOS  continuous  speech  recognition  system. 
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4  Lexical  and  Phonological  Knowledge 


One  of  the  basic  tasks  in  a  recognition  system  is  to  develop  a  phonetic  dictionary  (the 
lexicon)  and  to  allow  for  the  incorporation  of  phonological  knowledge.  Therefore,  we 
developed  an  interactive  tool  that  allowed  a  user  to  create  and  edit  phonetic  spellings  in 
a  dictionary.  The  tool  could  display  the  phonetic  network  corresponding  to  the  pronunci¬ 
ations  in  a  word.  In  addition,  the  system  allowed  phonological  rules  to  be  entered,  which 
would  then  create  new  expanded  pronunciations  for  a  vocabulary.  The  system  kept  track 
of  the  history  of  each  word,  so  that  it  was  possible  to  determine  which  rules  created 
which  pans  of  any  pronunciation.  We  used  these  tools  to  create  phonetic  dictionaries  for 
the  different  tasks  -  pnmanly  the  1000-word  FCCBMP  task. 

One  consideration  that  must  be  made  in  a  system  is  the  number  of  alternate  pro¬ 
nunciations  that  should  be  used  for  each  word.  In  principle,  one  would  like  to  represent 
all  of  the  likely  pronunciations,  since  they  would  result  in  different  acoustic  realizations 
of  the  words.  However,  as  we  add  pronunciations  to  a  word,  the  difference  between  this 
word  and  other  words  decreases.  In  particular,  if  we  add  extra  pronunciations  to  account 
for  inadequacies  of  the  phonetic  recogmtion  system,  then  we  must  constantly  change  this 
set  of  rules  as  the  system  improves.  Furthermore,  if  we  allow  alternate  pronunciations, 
we  must  be  careful  to  represent  the  fact  that  some  pronunciations  are  much  more  likely 
than  others. 

We  performed  experiments  to  determine  the  effect  of  having  different  numbers 
of  pronunciations.  We  found,  to  our  surprise,  that  the  best  recognition  performance 
was  achieved  when  we  limited  the  number  of  pronunciations  to  one  for  each  word. 
That  is.  even  when  we  had  a  limited  set  of  phonological  rules,  resulting  in  about  two 
pronunciations  per  word,  the  performance  was  worse.  On  reflection,  it  made  sense  that 
the  BYBLOS  system  would  perform  better  with  a  single  pronunciation  for  each  word. 
Most  of  the  phonological  variations  take  place  based  on  the  context  of  the  preceding 
and  following  phonemes.  However,  the  system  already  modeled  the  detailed  acoustic 
variation  in  phonemes  with  the  use  of  context-dependent  phonetic  HMM  models.  Thus 
this  probabilistic  model  for  fine  acoustic  differences  was  superior  to  a  gross  phonological 
model  of  the  shift  from  one  phoneme  to  another.  Some  systems  within  the  DARPA 
community  had  been  using  a  large  number  of  pronunciations  -  perhaps  10  or  more  per 
word.  In  particular,  the  CMU  Sphinx  system  performance  improved  dramatically  when 
all  the  alternate  pronunciations  were  removed  at  our  suggestion. 
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5  Language  Models 

Our  first  experiments  with  the  effect  of  grammatical  constraints  on  recognition  perfor¬ 
mance  involved  changing  the  vocabulary  size.  We  did  this  in  order  not  to  be  affected  by 
the  particular  words  that  could  follow  each  other  in  a  grammar.  Rather  we  performed  a 
set  of  experiments  that  we  called  “Branching  Factor  Experiments”,  in  which  we  varied 
the  vocabulary  size  in  a  systematic  way.  Given  a  desired  vocabulary  size,  the  recognition 
program  would  read  in  each  sentence  to  be  recognized.  It  then  limited  the  vocabulary 
to  the  words  actually  in  the  sentence,  plus  enough  other  words  to  make  up  the  desired 
vocabulary  size.  Many  different  random  subsets  of  the  vocabulary  were  chosen  in  order 
to  remove  any  bias.  This  made  it  possible  to  plot  the  expected  word  recognition  error 
rate  as  a  function  of  the  branching  factor.  We  found  that  the  error  rate  was  typically  pro¬ 
portional  to  the  square  root  of  the  branching  factor.  We  found,  when  we  ran  recognition 
experiments  using  a  deterministic  grammar,  that  this  relation  still  held.  That  is,  the  word 
recognition  error  rate  was  generally  proportional  to  the  grammar  perplexity. 

After  the  basic  branching  factor  experiments,  we  implemented  a  recognition  pro¬ 
gram  that  would  allow  the  recognition  to  be  constrained  by  finite-state  grammars.  This 
also  required  building  some  tools  that  made  it  possible  to  create  and  manipulate  grammars. 
We  found  that  it  was  simplest  to  specify  grammars  in  terms  of  context-free  production 
rules.  These  rules  were  then  expanded  into  a  finite-state  network.  This  was  possible 
because  we  did  not  include  any  rules  that  caused  recursion.  The  recognition  program 
allowed  deterministic  finite-state  automata  (DFA),  in  which  all  the  words  leaving  any 
state  of  the  grammar  were  unique,  and  nondeterministic  automata  (NFA),  in  which  there 
could  be  duplicate  words  or  “null  arcs"  that  allowed  a  transition  from  one  state  to  another 
without  going  through  a  word.  In  general  the  same  language  can  be  represented  with  a 
much  smaller  NFA  than  DFA. 

The  first  grammar  that  we  constructed  for  the  1000-word  FCCBMP  corpus  was 
based  on  the  sentence  patterns  used  to  make  up  the  sentences  in  the  corpus.  This  grammar 
had  a  perplexity  of  only  10.  We  found  that  the  sentence  patterns  that  were  used  to  generate 
the  2850  sentences  in  the  Resource  Management  corpus  were  not  very  robust.  That  is.  if  a 
person  generally  familiar  with  the  domain  made  up  sentences  using  the  same  vocabulary, 
there  was  a  high  likelihood  that  the  sentence  could  not  have  been  generated  by  these 
patterns.  One  important  issue  in  speech  and  language  recognition  is  how  to  make  a 
grammar  that  will  cover  a  large  percentage  of  new  sentences. 

W  developed  a  semiautomated  tool  for  inferring  grammatical  structure  from  a 
relat;<  f  small  set  of  patterns.  The  tool  found  similarities  in  sentences,  and  then  allowed 
that  \  .i  both  sentences  to  be  represented  by  a  context  free  rule.  In  this  way,  the  system 
generalize:,  he  sequences  of  words  and  word  classes  into  a  hierarchical  set  of  rules 
tha*  wouio  cover  a  much  larger  percentage  of  new  sentences  than  the  original  sentence 
patterns.  We  applied  this  tool  to  the  training  subset  of  the  sentence  patterns  to  create  a 
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generalized  context  free  grammar.  We  compared  the  perplexity  and  coverage  of  new  data 
of  this  new  grammar  to  the  original  pattern  grammar.  When  the  test  patterns  were  parsed 
using  only  the  training  patterns,  none  of  them  parsed.  However,  when  the  generalized 
context  free  grammar  was  used,  65%  of  the  test  patterns  parsed.  The  perplexity  of  this 
grammar  also  necessarily  increased,  from  10  to  75. 

To  use  this  new  grammar  we  built  a  version  of  the  decoder  that  could  use  context 
free  grammars  rather  than  just  finite  state  grammars.  This  involved  first  converting  the 
set  of  context  free  rules  into  a  recursive  transition  network,  in  which  the  several  right 
hand  sides  for  a  rule  are  merged  into  a  single  network  of  terminals  and  nonterminals. 
The  decoder  was  changed  so  that  instead  of  dealing  with  a  single  finite  state  network 
of  terminals  (words),  it  could  also  deal  with  networks  of  nonterminals.  When  the  sym¬ 
bol  in  a  network  was  a  terminal,  it  simply  created  a  new  instance  of  that  word  to  be 
matched.  When  it  was  a  nonterminal,  it  “pushed”  down  to  the  network  corresponding 
to  the  right  hand  sides  of  that  nonterminal.  Since  the  grammar  was  now  much  larger 
and  more  interconnected,  the  decoder  had  to  be  optimized  in  several  ways.  One  of  the 
major  optimizations  was  to  use  the  forward-backward  search  strategy  described  below. 
This  meant  that  when  the  decoder  came  upon  any  new  word  or  nonterminal,  it  would 
know  (from  the  forward  pass)  whether  this  word,  or  any  of  the  words  implied  by  this 
nonterminal  could  possibly  be  m  the  input  speech  starting  at  this  frame. 

Any  deterministic  grammar  derived  from  a  set  of  rules  will  have  a  problem  in 
that  new  test  data  may  not  be  able  to  be  parsed  by  the  grammar.  For  example,  when 
new  sentences  were  recorded  at  NOSC  (the  TONE  database),  we  found  that  most  of  the 
sentences  were  not  covered  by  the  sentence  pattern  grammar  or  the  word  parr  grammar. 
Even  the  generalized  context  free  grammar  covered  only  about  75%  of  the  sentences  that 
used  the  same  vocabulary  .  A  statistical  grammar  that  models  the  probability  of  the  next 
word  (or  word  class)  given  the  preceding  words  can  avoid  this  problem  by  assuming 
that  all  words  are  possible,  even  though  some  are  much  more  likely  than  others.  The 
statistical  grammar  also  has  the  additional  advantage  that  it  can  accurately  represent  the 
fact  that  some  words  or  classes  are  more  likely  than  others.  This  additional  information 
greatly  reduces  perplexity  and  increases  performance.  Therefore  we  began  an  effort  to 
estimate  and  use  a  robust  statistical  grammar. 

In  the  past  work  on  statistical  language  models,  the  training  set  for  estimating  the 
probabilities  of  word  sequences  needed  to  be  quite  large.  For  example,  IBM  currently 
uses  a  text  database  of  about  250  million  words  to  estimate  trigram  probabilities  in  their 
office  correspondence  task.  In  many  spoken  language  applications  there  is  no  possibility 
of  collecting  such  large  amounts  of  speech  because  the  application  does  not  yet  exist. 
Rather,  it  may  only  be  possible  to  collect  on  the  order  of  1000  sentences  from  a  simulation 
of  the  system.  To  alleviate  this  problem  we  have  extended  the  statistical  grammars 
typically  used  by  the  use  of  linguistic  knowledge.  In  particular,  we  group  the  different 
words  in  the  vocabulary  into  classes,  under  the  assumption  that  their  statistics  will  be 
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relatively  similar.  For  example,  in  the  FCCBMP  domain,  one  would  initially  assume 
that  the  names  of  all  ships  would  be  equally  likely  in  any  particular  sentence.  Therefore, 
when  we  see  a  training  sentence  that  contains  one  ship  name,  we  assume  that  we  have 
seen  a  similar  sentence  with  every  other  ship  name.  In  addition,  there  are  sequences  of 
words  tnat  behave  as  a  unit.  For  example,  there  are  many  ways  of  expressing  a  date. 
We  can  assume  that  in  the  model  for  a  sentence,  we  need  not  distinguish  among  these 
different  forms  of  the  date.  Therefore,  the  whole  date,  which  may  consist  of  several 
words,  is  treated  as  a  single  nonterminal.  This  greatelv  reduces  the  amount  of  training 
script  that  is  needed.  It  also  increases  the  number  of  words  over  which  the  grammar  has 
effect.  For  example,  we  can  now  predict  the  probability  of  a  particular  word  given  that 
it  was  preceded  by  a  preposition,  followed  by  a  date. 

We  developed  a  statistical  tool  that  allowed  us  to  estimate  a  variable  order  Markov 
chain  for  the  sequence  of  word  classes  and  nonterminals.  The  vanable  order  chain  has  the 
advantage  that  in  some  contexts,  there  is  enough  training  to  estimate  high  order  statistics, 
while  in  others,  only  first  order  statistics  can  be  reliably  estimated.  The  perplexity  of 
this  model  as  measured  on  the  training  data  was  about  20.  which  is  very  low.  When 
measured  on  independent  test  data,  the  perplexity  rose  to  about  60.  However,  unlike  the 
Word-Pair  grammar,  which  also  has  perplexity  60.  this  model  would  be  able  to  recognize 
sentences  that  do  not  come  from  the  grammar  training  set.  We  compared  the  recognition 
performance  of  this  grammar  with  that  of  the  Word-Pair  grammar.  When  independent 
test  data  was  used  the  error  rates  were  10%  and  22%,  respectively,  for  the  two  grammars. 
The  larger  error  rate  for  the  Word-Pair  grammar  stems  in  pan  from  the  fact  that  several 
word  pairs  in  the  test  set  were  not  allowed  by  the  grammar. 

This  work  on  statistical  language  modeling  from  small  corpora  will  be  imponant  in 
future  work  on  spoken  language  recognition  because  of  the  need  to  improve  recognition 
performance  through  the  use  of  statistics,  and  because  the  training  sets  for  new  tasks  will 
always  be  small. 
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6  Search  Strategies 


In  this  chapter  we  describe  our  work  on  developing  efficient  search  strategies  for  finding 
the  most  likely  word  sequence  given  the  acoustic  signal.  In  principle,  the  best  algorithm 
for  determining  the  word  sequence  given  an  acoustic  signal  is  to  consider  all  possible 
word  strings  exhaustively.  For  each  word  string  we  must  compute  a  score  (conditional 
probability)  of  that  word  sequence,  taking  into  account  all  the  sources  of  knowledge 
available.  Then  we  simply  choose  the  word  sequence  with  the  highest  score  as  our  answer. 
This  algorithm,  which  we  would  call  a  tightly  coupled,  top-down  search,  guarantees  the 
minimum  error  rate  for  a  given  set  of  knowledge  sources.  However,  this  exhaustive 
search  is  clearly  infeasible.  Therefore,  we  must  develop  search  strategies  that  approximate 
this  algorithm,  with  computation  that  is  acceptable.  However,  throughout  our  work  in 
developing  efficient  search  strategies,  we  always  must  keep  in  mind  that  we  are  trying  to 
approximate  the  effect  of  this  exhaustive  search.  The  remainder  of  this  chapter  enumerates 
several  of  the  different  search  search  strategies  that  we  have  developed  under  this  project. 

Some  of  the  general  principles  that  were  established  for  a  desirable  search  strategy 

were: 

1.  Use  the  computationally  “inexpensive”  knowledge  sources  to  reduce  the  number 
of  choices  drastically,  and  then  use  the  more  expensive  knowledge  sources  on  this 
reduced  set. 

2.  Any  decisions  made  in  (1)  must  be  made  in  a  way  that  almost  never  makes  a 
mistake,  otherwise  these  “irrecoverable  errors”  will  multiply  and  dominate  the 
errors. 

Two  search  strategies  that  are  commonly  used  in  speech  recognition  are  the  Best-First 
stack  search,  and  the  Viterbi  beam  search.  The  best-first  search  considers  only  the  best 
theory  at  a  particular  time.  It  extends  this  best  theory  by  all  possible  next  words,  scores 
all  these  new  theories  using  all  available  sources  of  knowledge,  and  reinserts  the  new 
scored  theories  back  into  a  stack  that  is  sorted  by  theory  score.  This  algorithm  has  the 
theoretical  advantage  that,  if  the  scores  are  meaningful,  it  should  do  the  least  amount  of 
computation.  However,  in  practice,  it  is  very  difficult  to  sort  the  theories  appropriately. 
In  particular,  it  is  very  hard  to  compare  two  theories  that  span  different  regions  of  the 
input  speech.  Therefore,  even  with  the  many  heuristics  that  are  used,  this  strategy  often 
finds  a  suboptimal  answer,  or  results  in  tremendous  amounts  of  computation. 

The  Viterbi  beam  search  is  much  easier  to  implement  than  the  best-first  search  and 
has  many  desirable  properties.  First,  it  guarantees  to  find  the  sequence  of  states  of  a 
finite-state  hidden  Markov  model  with  computation  that  is  proportional  to  the  number 
of  states  and  the  length  of  the  input  speech.  The  beam  search  implies  that  at  each  time 
frame,  all  theories  that  have  a  probability  that  is  sufficiently  far  below  the  probability  of 
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the  most  likely  theory  are  removed,  since  they  are  not  likely  to  result  in  a  better  score 
by  the  end  of  the  utterance. 

Unfortunately,  there  are  still  two  problems  with  the  Viterbi  search  algorithm.  First, 
the  algorithm  finds  the  most  likely  sequence  of  states,  rather  than  the  most  Likely  sequence 
of  words.  While  this  difference  is  often  small,  it  does  introduce  some  unnecessary  errors. 
Second,  and  more  serious,  most  interesting  models  of  language  have  a  very  large  number 
(if  not  infinite)  number  of  states.  For  example,  even  a  context-free  language  cannot  be 
represented  using  a  finite  number  of  states.  Therefore  the  computation  associated  with 
the  straightforward  beam  search  is  often  excessive. 

In  1985  we  devised  an  algorithm  that  has  the  simplicity  of  the  Viterbi  algorithm 
but  computes  a  score  that  more  closely  approximates  the  "true”  score  of  the  most  likely 
sequence  of  words.  Stated  simply,  the  algorithm  is  just  like  the  Viterbi  algorithm,  except 
that  at  each  state,  where  the  Viterbi  algorithm  keeps  the  maximum  score  from  all  pre¬ 
ceding  states,  our  algorithm  adds  the  scores  from  all  the  preceding  states.  We  call  this 
algorithm  a  pseudo-Baum-Welch  search  for  the  most  likely  word  sequence.  In  several 
experiments  we  verified  that  this  algorithm  results  in  somewhat  lower  error  rates  than 
the  Viterbi  algorithm.  The  interesting  observation  was  that  the  difference  in  error  rate 
was  relatively  constant  over  different  applications.  That  is,  our  pseudo-Baum-Welch  al¬ 
gorithm  consistently  resulted  in  2%  fewer  errors  than  the  Viterbi  algorithm,  whether  the 
original  error  rate  was  30%  or  5%.  Therefore,  when  the  error  rate  was  low  (which  it 
must  be  for  a  useful  system),  the  difference  was  important. 

There  were  certain  remaining  problems  with  the  algorithm  described  above.  Since 
the  score  produced  by  this  algorithm  is  not  exactly  the  same  as  the  true  score,  there  is 
still  some  chance  that  it  will  not  find  the  word  sequence  that  has  the  highest  true  score.  In 
addition,  since  we  use  a  pruning  algorithm  to  try  to  avoid  computing  most  of  the  scores, 
it  is  possible  that  the  algorithm  eliminates  the  correct  word  sequence  from  consideration 
without  computing  its  full  score.  Because  of  these  two  problems  it  is  helpful  to  know  how 
the  true  score  of  the  correct  answer  compares  with  the  true  score  of  any  incorrect  answer 
that  the  recognition  program  finds.  Therefore  we  added  a  feature  to  the  decoder  that 
allows  the  system  to  find  the  true  score  for  any  particular  word  sequence  for  a  sentence, 
by  scoring  only  that  word  sequence  We  call  this  a  farced  scoring  algorithm.  Whenever 
the  decoder  finds  the  wrong  word  sequence,  this  forced  scoring  algorithm  can  then  be 
used  to  find  the  true  score  for  the  correct  word  sequence  and  the  true  score  for  the  word 
sequence  that  the  decoder  found.  If  the  true  score  for  the  correct  word  sequence  is  higher 
than  the  score  for  the  word  sequence  found,  then  we  know  that  it  was  due  either  to  the 
pseudo-Baum-Welch  score  being  different  from  the  true  score,  or  due  to  pruning  out  the 
correct  answer.  When  we  tested  our  search  algorithm  using  this  new  feature,  we  found 
that  whenever  the  decoder  finds  an  incorrect  answer,  the  incorrect  answer  always  has  a 
higher  true  score  than  the  correct  answer.  This  confirms  that  the  decoding  algorithm  is 
empirically  optimal. 
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As  described  above,  it  is  also  possible  that  search  errors  are  the  result  of  pruning 
out  the  correct  answer  because,  at  some  point  in  the  utterance,  it  was  scoring  much  worse 
than  some  other  hypothesis.  Unfortunately,  unless  the  pruning  is  very  conservative,  there 
is  some  small  probability  that  the  most  likely  sequence  of  words  is  pruned  out  from 
consideration  due  to  a  small  region  where  it  scores  poorly.  While  this  might  happen 
only  once  in  25  sentences,  it  represents  an  unwanted  noise  in  our  estimation  of  error 
rates.  To  alleviate  this  problem  for  research  runs,  we  use  the  heuristic  described  above 
to  detect  errors  due  to  pruning  and  rerun  the  sentence  with  more  conservative  pruning  if 
necessary.  The  utterance  is  fust  run  at  a  very  aggressive  pruning  level,  which  results  in  a 
factor  of  10  speed  up.  If  the  answer  found  is  incorrect,  the  sentence  is  run  again  forcing 
the  correct  answer,  and  then  the  answer  that  was  found  during  the  search.  Both  of  these 
require  very  little  time,  since  only  one  sequence  of  words  is  possible.  If  the  score  for  the 
correct  answer  is  actually  higher  than  that  for  the  incorrect  answer,  then  the  utterance  is 
rerun  with  a  very  conservative  pruning  threshold.  We  find  that,  including  the  utterances 
that  need  to  be  rerun,  the  net  effect  is  a  factor  of  about  5  in  the  speed  of  research  runs 
of  the  decoder.  Of  course,  when  the  decoder  is  running  in  any  real  application  or  formal 
evaluation,  it  doesn't  know  the  correct  answer,  and  so  it  must  use  a  more  conservative 
pruning  or  accept  some  increased  error  rate. 

The  more  serious  problem  mentioned  at  the  beginning  of  this  chapter  was  that  for 
large  language  models,  the  number  of  word  states  that  need  to  be  scored  in  each  frame  can 
often  be  significantly  larger  than  the  number  of  words  in  the  vocabulary.  For  example,  the 
Sentence  Pattern  Grammar,  which  is  a  large  finite  state  grammar,  has  about  100,000  arcs 
initially  and  about  30,000  arcs  in  its  most  compressed  form.  Therefore,  if  the  pruning 
were  not  able  to  eliminate  more  than  97%  of  the  words  from  consideration,  the  number 
of  active  words  in  the  beam  search  would  be  larger  than  the  1000-word  vocabulary. 
Therefore,  we  have  developed  a  new  class  of  recognition  search  strategies,  which  we 
call  multiple-pass  search  strategies,  that  is  useful  for  speeding  up  the  search  with  large 
grammars,  such  as  statistical  grammars  and  natural  language  grammars.  These  algorithms 
find  upperbound  scores  for  each  of  the  words  in  the  vocabulary  in  different  regions  of  the 
input.  Then,  while  performing  a  grammar-directed  acoustic  search,  the  decoder  considers 
only  those  words  that  are  known  to  be  likely  given  the  input  speech.  The  particular 
version  of  this  paradigm  that  we  implemented  has  been  named  the  forward-backward 
search  because  of  its  similarity  to  the  forward-backward  training  algorithm. 

As  the  syntax-directed  search  is  proceeding  left-to-right  through  an  utterance,  it 
must  extend  each  theory  in  which  a  word  has  ended  by  all  the  possible  following  words. 
The  beam  search  reduces  the  number  of  theories  by  eliminating  those  for  which  the 
sentence  so  far  scores  badly,  compared  to  the  other  theories.  The  beam  search  would  be 
much  more  effective  if,  at  this  point  in  the  utterance,  it  could  know  the  score  that  the 
remainder  of  the  utterance  would  receive  also.  Then  the  pruning  could  be  based  on  this 
total  score.  Of  course,  computing  this  score  is  equivalent  to  performing  a  full  decode, 
which  is  what  we  are  trying  to  avoid.  However,  say  we  knew  the  score  for  the  most  likely 
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sequence  of  words  from  this  point  in  the  utterance  to  the  end  of  the  utterance  (ignoring 
grammatical  constraints).  This  score  would  be  an  upper  bound  on  the  actual  score  that 
would  be  found  were  it  computed  exactly.  Furthermore,  say  we  knew  this  upperbound 
score  for  each  possible  first  word  in  the  remaining  string.  Now.  we  could  consider  each 
of  the  words  that  can  come  next,  according  to  the  grammar,  and  for  each  one,  look  up 
the  upperbound  score  of  a  sequence  of  words  beginning  with  that  word  at  this  point  in 
the  utterance.  This  score,  when  added  to  the  score  of  the  theory  to  the  left,  can  then 
be  used  in  the  beam  pruning,  thus  eliminating  most  of  the  possible  continuations  of  this 
theory.  The  only  problem  with  this  algorithm  is  that  we  haven’t  seen  the  rest  of  the 
utterance  yet  (assuming  that  the  algorithm  is  running  in  close  to  real  time),  so  we  cannot 
possibly  compute  the  scores  of  word  sequences  in  the  future.  However,  if  we  turn  the 
problem  around,  there  is  a  solution  that  is  feasible.  Let  us  say  that  as  the  speech  is  given 
to  the  decoder  (in  real  time),  it  computes  the  scores  as  if  it  were  using  no  grammar  in  the 
recognition.  At  each  frame,  it  remembers  the  score  of  word  sequences  ending  with  each 
possible  word  in  the  vocabulary.  Only  a  small  fraction  of  the  words  in  the  vocabulary- 
will  end  with  a  good  score  at  each  frame.  When  the  end  of  the  utterance  is  detected, 
the  decoder  then  begins  a  grammar-directed  search,  but  in  the  reverse  direction.  Tins 
time,  since  most  words  have  been  eliminated,  the  decoding  proceeds  much  faster  than 
real  time.  The  most  likely  answer  is  then  found  with  only  a  short  delay  past  the  end  of 
the  utterance.  With  a  small  modification,  this  algorithm  can  also  be  made  to  run  forward, 
in  order  to  eliminate  even  the  small  delay  at  the  end  of  the  utterance. 

We  have  used  the  forward-backward  search  algorithm  described  above  to  speed  up 
the  search  for  several  very  large  grammars.  These  include  the  Sentence-Pattern  gram¬ 
mar.  a  high-order  statistical  grammar,  and  a  recursive  transition  network  grammar.  Our 
experiments  indicate  that,  for  these  large  grammars,  the  increase  in  speed  is  at  least  a 
factor  of  ten,  when  we  use  the  forward-backward  search  algorithm.  A  modification  of 
this  algorithm  would  use  a  first-order  statistical  grammar  in  the  first  pass,  in  order  to 
reduce  the  choices  further. 
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7  System  Implementation 


Much  of  the  work  of  this  project  was  necessarily  spent  implementing  the  different  training 
and  recognition  algorithms.  The  initial  set  of  algorithms  for  word-based  training  and 
recognition  were  implemented  on  the  Symbolics  Lisp  machine  using  Zeta-Lisp.  The 
flexible  environment  made  it  relatively  easy  to  implement  several  algorithms  quickly. 
However,  the  resulting  programs  were  not  very  fast,  since  Lisp  tends  to  result  in  slow 
computation.  The  result  was  that  many  experiments  were  impractical  to  run,  since  the 
time  that  they  required  was  too  great.  In  particular,  it  was  frequently  more  than  the  mean 
time  between  failure  of  the  machines. 

More  recently,  we  have  completely  redesigned  and  reimplemented  all  of  the  algo¬ 
rithms  in  C  on  the  SUN4  workstations.  The  implementation  takes  somewhat  longer,  but 
the  resulting  programs  run  about  an  order  of  magnitude  faster.  As  a  result,  it  is  possible 
to  mn  several  different  versions  of  the  programs  and  to  tune  different  parameters.  One 
of  the  direct  consequences  has  been  a  marked  improvement  in  the  recognition  accuracy 
of  the  system.  In  addition,  since  the  language  used  is  more  portable,  we  will  be  able  to 
take  advantages  of  newer,  faster  machines  as  they  become  available,  without  having  to 
redesign  all  the  algorithms.  We  are  currently  investigating  several  faster  machines  that 
would  increase  our  computing  power  by  at  least  a  factor  of  five. 
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8  Database  Specification  and  Documentation 


One  of  the  major  contributions  of  the  current  DARPA  program  in  speech  recognition 
has  been  the  specification,  recording,  and  adoption  of  a  standard  corpus  of  sentences 
for  research  and  standard  testing.  BBN  has  been  instrumental  in  the  specification  of  the 
corpus  and  the  testing  standards  to  be  used  with  that  corpus. 

Several  speech  corpora  were  defined  to  serve  the  different  research  needs  of  the 
community.  We  felt  that  it  would  be  important  to  have  a  wide  range  in  the  amount  of 
training  speech  available  for  each  speaker,  as  well  as  a  very  large  number  of  speakers 
available.  Therefore  the  basic  makeup  of  the  corpus  was  designed  so  that  there  would  be 
different  sections.  The  first  would  contain  640  speakers,  each  saying  10  sentences.  The 
second  would  contain  160  speakers,  each  saying  40  sentences.  The  third  would  be  geared 
for  speaker-dependent  research,  and  would  contain  12  speakers  with  600  training  sen¬ 
tences  for  each.  Finally,  there  would  be  2  to  4  speakers  with  2  to  4  hours  of  speech  each. 
The  640  speaker  corpus  consisted  of  material  designed  for  detailed  phonetic  research, 
and  was  thus  phonetically  marked.  This  corpus  has  been  called  the  TIMIT  database. 
The  other  corpora  consisted  of  sentences  pertaining  to  the  Resource  Management  task 
domain.  Each  of  the  corpora  contained  designated  training  sets,  development  test  sets  to 
be  used  while  trying  out  new  algorithms,  and  evaluation  test  sets  for  formal  testing. 

We  specified  the  sentences  in  the  FCCBMP  battle  management  domain  (later  to  be 
called  the  Resource  Management  task  domain)  through  lengthy  discussions  with  people  at 
NOSC.  This  task  involved  becoming  familiar  enough  with  the  application  and  likely  uses 
to  generate  a  1000-word  vocabulary  and  about  1000  different  sentences  with  database 
queries,  display  commands,  and  expert  system  questions.  As  such,  the  task  combined 
the  domains  that  reside  in  several  different  systems,  most  notably,  IDB,  OSGP.  and 
FRESH.  After  the  initial  sentences  had  been  composed,  and  checked  by  NOSC,  they 
were  converted  into  sentence  patterns  by  replacing  the  open  class  words  by  their  classes. 
Then,  as  many  sentences  as  desired  could  be  generated.  We  generated  three  sentences 
from  each  pattern,  resulting  in  approximately  2850  sentences.  These  sentences  were  then 
sent  to  TI,  where  some  additional  changes  were  made  before  recording.  (Some  words 
that  were  too  hard  to  pronounce  were  replaced  with  other  words.)  The  procedures  that 
were  followed  in  creating  this  extensive  corpus  were  documented  in  an  ICASSP  paper, 
which  is  included  as  an  appendix  to  this  report. 

In  order  to  be  able  to  compare  recognition  results  accross  different  research  sites 
using  different  algorithms,  it  was  necessary  to  assure  that  all  sites  were  using  the  same 
grammatical  constraints.  Since  the  BYBLOS  system  was  the  first  one  within  the  program 
to  produce  reasonable  recognition  results,  we  inherited  the  task  of  specifying  and  trying 
standard  test  conditions.  We  documented  the  phonetic  dictionary  that  we  had  developed 
and  made  it  ava’lable  to  other  sites.  We  showed  that  it  was  not  advantageous  to  have 
a  large  number  of  phonetic  pronunciations  for  each  word,  since  this  made  the  different 
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words  more  similar.  Finally,  we  provided  documentation  for  three  standard  test  grammars: 

1.  No  grammar  -  perplexity  1000. 

2.  Sentence  Pattern  Grammar  -  perplexity  10 

3.  Word-Pair  Grammar  -  perplexity  60 


The  first  grammar  used  for  testing  was  the  null  grammar  that  allowed  any  sequence 
of  words.  This  grammar  tested  the  basic  word  recognition  capabilities  of  the  systems. 
The  sentence  pattern  grammar  was  a  grammar  derived  from  the  patterns  used  to  generate 
the  sentences  in  the  Resource  Management  corpus.  Since  we  knew  that  this  grammar  was 
unrealistically  constrained,  we  created  another  grammar  that  allowed  all  pairs  of  words 
that  could  occur  anywhere  in  the  sentence  pattern  grammar.  This  increased  the  perplexity 
to  60.  making  it  a  much  more  reasonable  grammar.  By  using  these  three  grammars,  it 
was  possible  to  evaluate  the  word  recognition  capabilities  of  each  of  the  sites  at  different 
levels  of  difficulty. 

One  way  of  estimating  the  difficulty  of  a  task  is  to  measure  the  average  number 
of  words  that  can  come  after  each  word  in  the  language  model  used  with  the  task.  The 
mathematical  quantity  that  we  use  for  this  is  called  perplexity.  While  this  measure  doesn't 
take  into  account  the  phonetic  similarity  between  words,  it  has  been  found  to  correlate 
well  with  word  recogition  error  rate.  We  wrote  a  technical  note  to  document  the  precise 
techniques  used  to  measure  the  perlexity  of  a  language  model  on  any  particular  test  set 
of  sentences.  That  note  is  included  as  an  appendix  to  this  report. 

Finally,  as  the  word  recognition  capabilities  of  the  different  systems  has  improved, 
there  has  been  a  need  for  a  grammar  that  is  more  difficult  than  the  Word-Pair  grammar. 
While  the  recognition  performance  is  not  that  high  that  it  would  be  useful,  the  number  of 
errors  in  a  reasonable-sized  test  set  is  not  large  enough  to  be  measured  with  statistical  re¬ 
liability.  Furthermore,  testing  with  no  grammar  is  also  too  unrealistic,  because  it  requires 
many  distinctions  that  would  never  be  needed  in  a  system.  Therefore,  we  developed  a 
new  standard  test  grammar  based  on  a  first  order  statistical  model  of  word  classes.  The 
grammar,  which  is  based  on  only  100  word  classes,  was  designed  to  have  a  statistical 
perplexity  of  about  100,  which  will  result  in  about  twice  the  error  rate  associated  with  the 
word-pair  grammar.  We  estimate  that  the  difficulty  of  this  grammar  is  comparable  to  the 
difficulty  of  a  more  realistic  task  with  about  5000  words,  where  the  test  (actual)  sentences 
may  not  be  quite  so  similar  to  the  training  sentences.  We  documented  and  distributed  a 
set  of  programs  for  estimating  and  constructing  this  grammar  from  an  annotated  lexicon 
and  a  corpus  of  sentences  that  has  only  orthographic  transcriptions. 
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9  Testing  System  Performance  and  Demonstrations 


Throughout  the  project  we  have  performed  several  formal  and  informal  evaluations  of  the 
recognition  accuracy  of  the  BYBLOS  system.  The  early  tests  and  demonstrations  were 
informal,  since  BYBLOS  was  the  only  complete  system  at  that  time.  The  later  tests  were 
more  formal.  The  formal  tests  have  been  in  accordance  with  rules  agreed  upon  among 
the  different  research  sites,  together  with  NBS  (now  NIST). 

Our  early  work  in  context-dependent  phonetic  hidden  Maikov  models  for  phonetic 
recognition  was  incorporated  into  a  word  recognition  system  during  1985.  The  first  test 
of  this  system  was  on  a  small  (334  words)  electronic  mail  task.  The  domain  consisted  of 
commands  to  an  electronic  mail  system.  300  training  sentences  and  100  test  sentences 
were  recorded  from  each  of  3  male  speakers.  We  compared  the  recogmtion  accuracy  with 
context-dependent  phonetic  models  of  different  types  with  the  accuracy  when  context- 
independent  models  were  used.  The  recognition  experiments  were  run  first  with  no 
grammar,  and  later  with  a  finite-state  grammar  made  up  to  model  all  of  the  sentences. 
Averaged  over  the  3  speakers,  the  word  recognition  accuracy  with  context-independent 
models  was  76%.  When  context-dependent  models  were  used  the  accuracy  was  90%. 
When  a  grammar  with  perplexity  31  was  used,  the  recognition  accuracy  improved  from 
94%  to  98.2%.  This  represented  convincing  proof  of  the  viability  of  the  use  of  context- 
dependent  phonetic  models  and  of  the  use  of  HMM  models  in  general. 

During  the  next  several  months  of  1986  we  implemented  a  350-word  subset  of  the 
FCCBMP  Resource  Management  Task  Domain.  This  involved  recording  training  and  test 
sentences  for  the  new  domain  and  running  similar  tests  to  those  described  above.  The 
results  were  quite  similar.  Next,  we  added  300  new  words  to  the  test  vocabulary,  to  test 
the  effect  of  having  test  words  that  were  not  included  in  the  training.  Most  of  the  words 
added  were  additional  names  of  ships  and  ports.  We  found  that  the  recognition  results 
were  quite  similar  to  those  reported  earlier. 

During  the  spring  of  1986  we  implemented  a  demonstration  of  the  BYBLOS  system 
that  would  allow  a  user  to  speak  a  sentence  and  have  the  answer  appear  about  one  minute 
later.  The  system,  which  ran  on  a  Symbolics  Lisp  machine  displayed  its  progress  as  it 
attempted  to  recognize  what  was  said.  In  particular,  it  displayed  a  tree  of  the  most  likely 
sentence  hypotheses  that  were  under  consideration.  In  July,  1986  we  held  a  demonstration 
of  this  system  at  BBN. 

In  addition  to  using  speaker-dependent  models  derived  from  300  sentences  from 
one  speaker,  we  demonstrated  the  speaker  adaptation  capability  of  the  system.  Forty 
sentences  were  recorded  from  each  of  the  visitors.  These  sentences  were  used  to  transform 
a  speaker-dependent  model  from  one  speaker  so  that  it  could  be  used  for  the  new  speaker. 
While  the  recognition  accuracy  with  the  adapted  model  was  not  as  high  as  for  the  speaker- 
dependent  model,  it  was  still  quite  reasonable. 
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During  the  end  of  1986  we  tested  the  BYBLOS  system  on  the  full  1000-word 
vocabulary.  The  tests  were  run  on  two  speakers  from  BBN,  since  the  data  being  recorded 
at  TT  was  not  yet  available.  The  word  recognition  accuracy  with  no  grammar  was  87% . 

The  speech  data  for  four  speakers  was  made  available  in  January  of  1987.  This 
was  the  first  formal  test  of  the  system  using  data  from  outside  BBN.  In  this  case  we  had 
600  sentences  (approximately  30  minutes  of  speech)  from  each  speaker.  We  ran  tests 
under  three  grammar  conditions: 

1.  Sentence  Pattern  Grammar  -  perplexity  10 

2.  Word-Pair  Grammar  -  perplexity  60 

3.  No  grammar  -  perplexity  1000. 

The  recognition  results  are  presented  for  each  of  the  grammars  and  for  the  four 
speakers  from  TI  as  well  as  the  two  speakers  from  BBN. 

Sentence  Pattern  Word  Pair  No  Grammar 

TMspkrs  97.8%  89.9%  65% 

BBN-2spkrs  99.8%  98.2%  87% 

The  results  showed  clearly  that  the  recognition  accuracy  depends  significantly  on 
the  grammar  used.  They  also  indicated  that  the  speech  recorded  at  BBN  resulted  in 
much  higher  recognition  accuracy  than  that  recorded  at  TI.  This  was  presumably  due  - 
at  least  partially  -  to  the  speakers  at  BBN  speaking  more  carefully.  It  also  indicates  that 
a  significant  improvement  in  system  performance  can  be  achieved  by  a  certain  amount  of 
instruction  to  prospective  users  in  how  to  use  the  system.  This  improvement  is  at  least 
comparable  to  the  difference  in  performance  resulting  from  algorithm  improvements. 

On  July  27,  1987  we  gave  a  demonstration  that  showed  how  an  integrated  spoken 
language  system  could  be  used  for  the  FCCBMP  application.  A  graphics  system  that 
enabled  a  user  to  manipulate  objects  on  a  map  was  connected  to  the  natural  language 
system,  so  that  typed  commands  and  questions  would  be  answered  and  would  result  in 
appropriate  displays  on  the  map.  The  output  of  the  speech  recognition  was  then  connected 
to  the  natural  language  so  that  commands  and  questions  could  be  spoken.  While  the 
connection  between  the  speech  and  the  natural  language  was  serial,  it  illustrated  the 
power  that  such  a  spoken  language  system  would  have. 

One  of  the  requirements  for  the  October,  1987  meeting  was  that  some  of  the  results 
reported  would  be  from  a  "live  test”,  which  meant  that  the  speakers  were  speaking  directly 
to  the  system,  which  would  recognize  each  sentence  and  display  the  answer  before  they 
would  speak  the  next  sentence.  On  July  27.  the  three  speakers  who  were  to  be  in  the 
test  came  to  BBN  to  provide  training  speech  in  order  that  we  could  compute  speaker- 
dependent  models  for  the  speakers.  (The  speakers  were  Alan  Sears,  David  Pailett,  and 


18 


Report  No.  7138 


BBN  Systems  and  Technologies  Corporation. 


Tice  De Young.)  Each  speaker  read  training  sentences  during  a  total  elapsed  time  that  was 
limited  to  one  hour.  The  recording  took  place  in  two  half-hour  sessions.  Afterwards,  we 
listened  to  all  of  the  recorded  sentences  and  deleted  those  where  the  words  spoken  were 
different  from  those  in  the  text  transcriptions.  On  the  average,  80%  of  the  utterances  were 
kept,  resulting  in  about  350  training  utterances  for  each  speaker,  or  about  18  minutes  of 
actual  speech.  On  September  29,  the  three  speakers  returned  to  test  the  system.  The 
word  models  for  each  of  the  speakers  were  transfered  to  the  Butterfly  computer  which 
performed  the  recognition.  The  grammar  used  was  the  Word-Pair  Grammar.  Each  of 
the  speakers  read  30  test  sentences,  one  by  one,  and  waited  for  the  recognition  answer 
to  be  typed  out.  .All  input  data  and  recognition  results  were  also  saved  on  files  for  later 
analysis.  On  average,  the  recogmtion  time  was  10-40  seconds,  or  about  10  times  real 
time  In  each  case,  the  speaker  was  able  to  finish  the  entire  session  (including  putting  on 
the  microphone,  comments,  adjusting  levels,  and  false  starts)  within  1/2  hour  of  elapsed 
time.  The  word  recogition  error  rates  for  the  three  speakers  were  AS;  4.4%,  DP;  5%, 
TD;  12%  These  same  sentences  were  also  processed  on  the  Lisp  Machine  simulation  of 
the  decoder  to  provide  recognition  results  with  no  grammar. 

In  addition  to  the  live  tests,  there  were  also  formal  tests  run  using  the  data  recorded 
at  TI.  In  this  case,  test  data  from  eight  of  the  speakers  was  evaluated.  Four  of  the  speakers 
had  been  used  in  the  tests  performed  in  March.  1987.  The  speaker-dependent  models 
were  generated  using  570  of  the  hOO  training  sentences  (we  reserved  30  for  in-house 
testing).  In  August  we  received  from  N'BS  the  set  of  25  test  sentences  to  be  used  for 
testing.  Again,  tests  were  run  using  both  the  Word-Pair  Grammar  ar.d  no  grammar.  We 
no  longer  used  the  Sentence  Pattern  Grammar,  since  it  was  judged  to  be  unrealistically 
easy.  The  results  were  consistent  with  those  obtained  in  March.  The  average  error  rates 
were  32%  with  no  grammar,  and  7.5%  with  the  Word-Pair  Grammar. 

Dunng  the  October.  1487  meenng  we  demonstrated  the  BYBLOS  system  in  the 
conference  room  where  the  meeting  was  held.  There  were  several  technical  problems 
related  to  getting  the  audio  signal  from  this  room  back  to  our  A/D  facility,  which  was  sev¬ 
eral  hundred  yards  away  in  a  different  building.  The  solution  that  was  finally  chosen  w  as 
to  transmit  the  signal  over  unused  telephone  wires  that  went  between  the  buildings.  The 
demonstrations  included  a  near-real  time  demonstration  of  recogninon  on  the  Butterfly 
Parallel  Processing  system.  In  this  demonstration,  the  system  displayed  the  hypothesized 
word  string  as  it  processed  the  sentence.  Frequently,  the  first  two  or  three  u’ords  were 
displayed  before  the  speaker  finished  speaking  the  sentence.  Several  speakers  were  used 
dunng  this  demonstration. 

In  April.  1988  we  tested  the  new  speaker  adaptation  algonthms  developed  under 
the  Basic  Research  effort  on  the  Resource  Management  Database,  and  the  data  collected 
"live”  at  BBN.  We  found  that,  on  the  average,  the  performance  of  the  system  when 
models  were  adapted  using  two  minutes  of  speech  from  the  new  speaker  was  equal  to 
that  denved  w  hen  18  minutes  of  speech  from  the  new  speaker  was  used  with  the  speaker- 
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dependent  training  algorithm.  This  is  a  significant  improvement  over  our  previous  speaker 
adaptation  algorithm,  which  resulted  in  performance  equivalent  to  about  8  minutes  of 
speaker-dependent  training. 

During  early  1988,  we  redesigned  and  implemented  the  BYBLOS  system  on  the 
SUN4  workstations.  Previously,  on  the  Symbolics  Lisp  machine,  the  limitations  of  slow 
computation  and  limited  virtual  address  space  made  it  difficult  to  run  many  experiments. 
In  particular,  we  were  not  able  to  use  multiple  sets  of  spectral  parameters  in  the  training 
and  recognition.  The  primary  difference  in  the  SUN4  version  was  that  we  now  used  the 
derivatives  of  the  cepstral  parameters  in  addition  to  the  cepstral  parameters  themselves. 
In  addition,  it  was  now  possible  to  run  many  experiments  in  order  to  tune  various  system 
parameters.  All  tuning  was  done  on  parts  of  the  training  set  or  on  the  October,  1987  test 
set.  We  then  ran  the  May  1988  test  data  for  all  12  speakers  through  the  decoder  using  the 
word-pair  grammar  and  the  null  grammar;  the  word  error  rate  was  now  3.4%  and  16.2% 
respectively.  This  represented  a  word  error  rate  reduction  by  a  factor  of  two  relative  to 
the  previous  system. 

Shortly  after  running  these  tests,  we  completed  our  work  on  smoothing  the  prob¬ 
ability  distributions  of  the  HMMs.  When  we  received  the  test  data  for  the  February 
1  v  89  meeting,  we  decided  to  run  the  experiments  both  with  and  without  the  smoothing 
algorithm.  At  the  meeting  in  February  we  presented  the  effect  of  smoothing  on  both  the 
May  ’88  and  February  '89  test  sets.  The  word  error  rates  are  given  below. 

Word-Pair  No  Grammar 

Control  Smoothing  Control  Smoothing 
May  '88  3.4  2.7  16.2  15.2 

Feb  '89  2.9  3.1  15.3  13.8 

This  showed  that,  although  the  smoothing  algorithm  helped  in  most  cases,  it  did 
not  always  improve  the  performance.  However,  the  overall  results  were  the  best  reported 
to  date  by  any  other  research  site  on  this  corpus. 
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10  Speaker  Adaptation 


A  key  area  of  our  work  has  been  in  rapid  speaker  adaptation.  Most  of  the  systems 
reported  to  date  are  based  on  two  training  paradigms.  In  the  speaker-dependent  paradigm, 
a  relatively  large  amount  of  speech  from  the  user  is  collected,  in  order  to  estimate  a  very 
accurate  model  of  how  that  speaker  speaks.  The  result  is  very  high  recognition  accuracy. 
In  the  speaker-independent  paradigm,  speech  is  collected  from  a  very  large  number  of 
speakers  (at  least  100  speakers).  This  speech  is  pooled  as  if  it  all  came  from  a  single 
speaker,  and  is  used  with  exactly  the  same  algorithm  as  in  the  speaker-dependent  ase. 
The  result  is  that  when  a  new  speaker  speaks  to  the  system,  the  recognition  accuracy 
is  not  degenerately  bad,  but  the  word  error  rate  is  still  about  a  factor  of  2  to  3  times 
that  in  the  well-trained  speaker-dependent  paradigm.  In  addition,  for  many  applications, 
it  would  be  impractical  to  collect  speech  from  a  large  number  of  speakers  just  for  that 
application. 

The  speaker  adaptation  paradigm  provides  an  alternative  to  these  two  approaches. 
The  algorithm,  which  is  quite  different  from  the  basic  speaker-dependent/independent 
algorithm,  starts  with  a  well-trained  model  from  a  single  reference  speaker.  This  speaker 
is  presumably  one  who  has  trained  the  system  in  the  speaker-dependent  paradigm.  Then, 
a  small  amount  of  speech  is  collected  from  the  new  (target)  speaker.  This  speech  is 
used  to  transform  all  of  the  models  of  the  reference  speaker  so  that  they  are  appropriate 
for  the  target  speaker.  The  resulting  system  performance  is  somewhat  better  than  that 
in  the  speaker-independent  paradigm,  but  somewhat  worse  than  the  speaker-dependent 
paradigm,  at  a  small  fraction  of  the  cost  of  data  collection.  This  paradigm  has  the 
additional  advantage  that  it  will  be  natural  for  the  user  to  adapt  the  system  whenever  the 
acoustic  environment  or  his  voice  should  change  for  any  reason. 

We  have  investigated  several  new  algorithms  for  rapid  speaker  adaptation.  The 
major  contnbution  of  this  effort  has  been  a  probabilistic  spectral  mapping  algorithm  that 
transforms  the  reference  speaker  model  into  a  target  speaker  model.  We  have  experi¬ 
mented  with  several  algorithms  for  estimating  this  mapping,  and  have  presented  recog¬ 
nition  performance  results  at  several  of  the  project  meetings. 

This  area  remains  one  of  our  key  research  areas,  since  we  feel  that  ultimately  the 
rapid  adaptation  paradigm  will  be  the  most  practical  for  new  users.  The  system  will 
begin  by  prompting  a  new  user  to  read  a  small  number  of  sentences.  Then,  as  the  user 
speaks  to  the  system,  it  will  incrementally  improve  the  performance  until  it  eventually 
becomes  a  high-performing  speaker-dependent  system. 
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Appendix 

This  appendix  contains  a  number  of  papers  that  have  been  written  under  this  contract. 
Below  is  the  list  of  papers  included. 
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ner,  J.  Makhoul,  Recognition  performance  and  grammatical  constraints,  Proc.  DARPA 
Workshop  on  Continous  Speech  Recognition,  Palo  Alto,  CA,  February  1986. 
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Price,  S.  Roucos,  and  R.M.  Schwartz,  BYBLOS:  The  BBN  continuous  speech  recognition 
system,  IEEE  International  Conference  on  Acoustics,  Speech,  and  Signal  Processing, 
Dallas,  Texas,  pp. 89-93.  April  1987. 
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ABSTRACT 

We  describe  the  integration  of  grammatical  with 
acoustic  knowledge  sources  in  the  BBN  continuous  word 
recognition  system,  and  the  resulting  effects  on 
performance  This  combination  decreases  the  total 
number  of  insertions,  deletions  and  substitutions  by  a 
factor  of  more  than  6  compared  to  the  system  with  no 
grammatical  constraints,  and  yields  a  word  accuracy  of 
better  than  987!.  We  show  that  constraining  the  set  of 
possible  word  sequences  can  improve  performance,  even 
when  the  amount  of  training  per  lexical  item  remains 
fixed  In  addition,  we  address  the  issues  of  estimating 
from  limited  data  the  degree  of  constraint  imposed  by  a 
grammar  and  the  importance  of  incorporating  acoustic 
similarity  in  such  measures  1 


1  INTRODUCTION 

In  this  report  we  describe  the  development  and  use 
of  various  finite  state  grammars  in  the  BBN  continuous 
speech  recognition  system  In  particular,  we  investigate 
the  relationship  between  recognition  performance  and 
the  degree  of  constraint  imposed  by  a  grammar  We  feel 
that  understanding  such  relationships  is  crucial  to 
evaluating  how  well  specific  techniques  of  linguistic 
modeling  can  be  generalized  to  larger  and  more  complex 
tasks 

It  is  well  known  that  recognition  performance 
improves  as  vocabulary  size  decreases  Similarly,  when 
syntactic  and  semantic  information  are  used  to  reduce 
the  number  of  words  that  can  legally  follow  a  given 
sequence  of  words,  a  recognizer  is  expected  to  make 
fewer  errors  Two  related  measures  of  this  type  of 
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grammatical  constraint  are  perplexity  and  branching 
factor  decreasing  these  characteristics  of  a  grammar 
should  lead  to  improved  performance  We  shall  discuss 
how  these  measures  can  be  estimated  when  only  a  small 
set  of  representative  sentences  are  available 

In  the  following  section  we  describe  our 
recognition  system  In  section  3.  we  describe  a  set  of 
experiments  designed  to  demonstrate  the  relationship  of 
performance  to  branching  factor  when  the  amount  of 
training  per  item  remains  constant  We  then  address 
the  issue  of  estimating  degree  of  grammatical  constraint 
from  limited  data  (section  4).  In  section  5  we  describe 
the  incorporation  of  various  grammars  in  our 
recognition  system  and  the  resulting  effects  on 
performance 


2  THE  SPEECH  RECOGNITION 
SYSTEM 

The  speech  recognition  system  consists  of  a 
feature  extraction  stage,  an  acoustic  scoring  and  a 
linguistic  scoring  The  feature  extraction  stage 

computes  the  short-time  spectral  envelope  everv 
centisecond  and  represents  it  by  14  Mel-warped 
cepstral  coefficients  A  vector  quantizer  discretizes  the 
spectral  envelope  to  one  of  256  spectral  templates  using 
Euclidean  distance  The  sequence  of  discrete  spectre  .s 
used  to  compute  the  likelihoods  of  all  possiti* 
hypotheses  in  the  acoustic  and  linguistic  scoring 
modules  Recognizing  an  input  utterance  involves 
finding  the  sequence  of  words  Wj  ...  wB  that  maximizes 

P(ljX2  Xjj  I  Wj  wB)  P(Wj  wB) 

where  Xj  .Xjj  is  the  sequence  of  quantized  spectra 
and  Wj  ...  wB  is  a  sequence  of  words  The  first  term 
the  acoustic  score  is  derived  from  a  hidden  -  Markov 
model  (HMM)  for  each  word  The  second  term  the 
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linguistic  score,  is.  in  principle  a  model  of  the  expected 
syntax  and  semantics  This  term  includes  a  model  of 
duration  (longer  sequences  are  less  likely),  and  a 
grammatical  score  At  present,  due  to  limited  data,  the 
grammatical  score  is  simply  set  to  1  for  sentences 
allowed  by  the  grammar  and  to  0  otherwise 

The  dictionary  used  was  developed  and  made 
available  to  us  by  the  speech  group  at  Carnegie-Mellon 
University  We  expanded  it  (from  abeut200  words)  to  334 
words  in  order  to  fill  out  categories  that  were 
represented  in  the  original  version  In  particular,  our 
version  includes  all  months,  all  days  of  the  week, 
possessives  for  all  proper  nouns  and  plurals  for  all 
other  nouns,  and  cardinals  and  ordinals  to  cover 
numberrup  to  999 

The  training  for  our  system  was  on  300  sentences 
(about  IS  minutes)  for  each  talker  These  sentences 
were  syntactically  and  lexically  based  on  100  example 
sentences  also  provided  by  CMU  We  reserved  the  set  of 
100  sentences  for  testing  The  sentences  were  designed 
to  be  representative  of  human-machine  interaction  in 
an  electronic  mail  task,  referred  to  as  the  Email  task. 

Our  word  models  are  phonetically  based  and 
capture  the  acoustic  coarticulatory  effects  within  a  word 
to  the  extent  that  they  can  be  estimated  reliably  from 
available  training  data  In  short,  to  obtain  robust 
estimates  of  the  transition  and  output  distributions  of 
the  HMM  for  a  phoneme-in-context  we  use  a  weighted 
average  of  the  parameters  of  models  with  varying 
amount  of  context  The  details  of  these  word  models 
are  discussed  in  [2] 

The  linguistic  model,  which  computes  the  a  priori 
probability  of  a  word  sequence,  uses  one  of  two  types  of 
models  for  the  language  The  first  model  has  no  grammar 
and  allows  any  word  sequence  In  this  case,  the 
probability  of  a  word  sequence  is  determined  by  its 
length 

p[wj  wk]  »  c  a-k 

where  a  is  just  an  insertion  penalty  that  is  chosen 
empirically  to  control  the  insertion  rate  of  the 
recognizer  output  and  c  is  a  normalizing  constant  The 
second  language  model  is  a  finite  state  automaton  We 
describe  in  a  later  section  how  we  generated  the  finite 
state  grammars  from  a  small  corpus  of  sentences  At 
present,  sentences  are  either  accepted  or  rejected  as 
grammatical  depending  on  whether  the  automaton  parses 
them  or  not  Given  sufficient  data  to  determine  the 
likelihood  of  different  word  sequences  the  paths  of  the 
automaton  could  be  modified  to  impose  probabilities  on 
sentences  of  the  grammar 


3  RECOGNITION  ACCURACY  AND 
BRANCHING  FACTOR 

It  is  well  known  that  recognition  performance 
improves  with  smaller  vocabulary  size,  with  or  without 
grammatical  constraints  The  improved  performance  may 
stem  from  two  factors  (1)  the  smaller  set  of  elements 
that  need  to  be  distinguished,  and  (2)  the  greater 
amount  of  training  that  can  be  devoted  to  each  of  the 
items  As  vocabulary  size  increases,  comparable  training 
becomes  more  difficult  Since  our  goals  involve 
increasing  vocabulary  size  we  felt  it  was  important  to 
establish  that  the  first  of  the  above  factors  alone  i  e 
smaller  vocabulary  size  (which  can  be  simulated  by  using 
a  grammar),  is  sufficient  to  improve  performance  without 
increasing  the  amount  of  training  per  lexical  item 
Further,  we  would  like  to  investigate  the  relationship 
between  performance  and  constraints  such  as  vocabulary 
size  or  grammatically  A  set  of  experiments  was 

designed  to  simulate  the  effect  of  grammatical 

constraints  over  a  range  of  branching  factors  This  was 
done  by  restricting  the  set  of  lexical  items  to  the  words 
appearing  m  a  given  test  sentence  plus  additional  words 
selected  randomly  from  the  dictionary  until  the  total 
number  of  words  is  equal  to  the  desired  branching 
factor 


3.1  Methodology 

We  investigated  branching  factors  of  10  20  50 

100.  200.  and  334  The  last  figure  includes  the  entire 
dictionary  Performance  was  assessed  for  the  task  of 
recognizing  30  of  the  100  test  sentences,  described 
earlier,  as  produced  by  three  male  talkers  Since  we 
had  previously  made  changes  in  our  system  based  on 
recognition  of  these  30  sentences,  we  repeated  the 
experiment  for  the  smallest  and  largest  branching 
factors  on  the  70  previously  unused  sentences  Since 
performance  at  these  points  for  the  new  sentences  did 
not  differ  greatly  from  the  results  based  on  the  30 
sentences  (performance  was  actually  about  1”  better  on 
the  new  sentences),  we  present  the  results  based  on  the 
30  sentences 

In  order  to  achieve  comparable  statistical 
significance  across  the  tests  at  various  branching 
factors  (BF).  that  is.  to  adequately  sample  the  dictionarv 
for  each,  we  increased  the  number  of  repetitions  for 
experiments  at  lower  BF  BF  of  10  was  repeated  at  least 
10  times  per  talker  per  sentence.  BF  20  til  times i  BF 
50  (6  times),  BF  100  (3  times),  BF  200  (twicel  and  BF  334 
(once) 
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Figure  1:  Performance  and  Branching  Factor. 

Plotted  Is  word  accuracy,  (substitutions 
+  deletions)  divided  by  the  total  number 
of  words  In  the  test  sentences,  averaged 
across  3  male  speakers,  as  a  function 
of  branching  factor. 


3.2  Result*  end  Discussion 

Figure  1  shows  the  error  rate  averaged  across  the 
3  talkers  productions  of  the  30  test  sentences. 
Performance  is  plotted  as  a  function  of  branching  factor 
on  a  log-log  scale  It  is  seen  that  performance 
increases  (linearly  on  this  scale)  with  smaller  branching 
factors  word  accuracy  improves  from  about  907.  for  the 
full  dictionary  to  about  96  5 7.  for  the  branching  factor 
of  10  As  mentioned  earlier,  performance  on  the 
remaining  70  test  sentences  was  about  17.  better  for 
branching  factor  of  10  The  repetitions  of  the 
experiment  allow  us  to  sample  the  effects  of  various 
choices  of  vocabulary  items,  but  not  the  effects  of 
variability  in  articulation  In  fact,  our  entire  set  of 
errors  for  the  branching  factor  of  10  correspond  to  one 
or  two  words  produced  by  each  talker  Given  this 
distribution  of  errors  and  the  difference  between  the 
percentage  of  errors  on  the  two  sets  of  sentences,  we 
conclude  that  30  test  senttnees  (167  words  per  talker) 
are  not  sufficient  to  reliably  estimate  performance  in 
this  case  The  experiment  has.  however  confirmed  our 
hypothesis  that  reduction  of  the  number  of  allowable 
words  is  sufficient  to  improve  performance  without 


increasing  training,  and  we  feel  that  the  methodology 
may  prove  useful  for  estimating  the  performance  of  a 
recognition  algorithm  on  tasks  differing  in  the 
complexity  required  of  the  grammar  In  order  to 
quantify  this  complexity,  we  present  several  methods  for 
estimating  the  amount  of  constraint  imposed  by  a 
grammar 


4  ESTIMATING  GRAMMATICAL 
CONSTRAINT 


When  recognition  is  performed  without  a  grammar, 
the  set  of  possible  outcomes  is  the  set  of  all  possible 
combinations  of  the  lexical  items  The  role  of  a 
grammar  is  to  disallow  some  of  those  combinations  This 
means  that  at  any  point  the  grammar  has  to  choose  not 
from  the  entire  set  of  lexical  items,  but  from  a  smaller 
set  By  reducing  the  legal  possibilities  the  grammar 
imposes  a  constraint  which  makes  the  recognizer  s  task 
easier  How  does  one  measure  the  constraint  imposed 
by  the  grammar0  One  would  like  to  average  the  number 
of  choices  at  various  points  and  weight  them  according 
to  how  likely  they  ere  to  occur  Such  a  measure,  based 
on  the  information  theoretic  concept  of  entropy,  exists 
and  is  called  "perplexity"  [l]  For  a  deterministic  finite 
state  automaton  we  define  its  entropy.  H.  by 


t:  p(>)  h(t) 


where  p(i)  is  the  probability  of  node  i.  and  h(i)  is 
the  entropy  of  the  set  of  choices  emanating  from  that 
node  The  perplexity,  Q  is 


q  *  eH 


The  perplexity  of  a  grammar  is  determined  by  the 
network  connectivity  and  the  probability  assignment  of 
the  different  transitions.  In  our  case,  the  network 
connectivity  is  determined  by  the  types  of  linguistic 
phenomena  captured  in  a  particular  grammar  The 
probability  assignment  of  the  transitions  is.  however 
more  difficult  The  basis  for  our  grammar  was  a  set  of 
100  sentences  intended  to  represent  rather  than  tc 
define  the  language  In  fact,  many  different  grammars 
can  be  built  to  cover  all  or  most  of  these  sentences 
while  differing  greatly  in  the  number  and  type  of 
additional  sentences  covered,  and,  more  importantly 
differing  m  their  perplexity  The  problem  now  becomes 
the  estimation  of  perplexity  given  a  set  of 
representative"  sentences  We  propose  three  methods 
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The  first  is  the  maximum  perplexity  of  a  finite  language 
(6)  which  is  obtained  by  solving  for  the  positive  root  Xq 

of  ^max 

.  X!  Nk  x-k  =  1 

k--l  k 

where  Nk  is  the  number  of  sentences  of  length  k 
in  the  language.  is  the  length  of  the  longest 

sentence  in  the  language,  and  Zq  is  the  desired 
maximum  perplexity 

A  second  measure,  which  we  will  call  the  uniform 
branching  estimate  of  perplexity.  is  obtained  by 
assuming  all  transitions  from  a  node  in  the  grammar  to 
be  equally  likely 

The  third  measure,  called  test  set  branching 
factor,  uses  the  set  of  test  sentences  to  estimate  the 
average  branching  factor  encountered  by  traversing  the 
FS  network  along  the  p  ‘hs  corresponding  to  each 
sentence  We  use  the  geometric  mean  of  the  number  of 
branches  at  each  node  over  all  the  test  sentences  as  an 
estimate  of  task  perplexity 

All  the  above  measures  ignore  the  acoustic 
similarity  of  the  words,  an  important  factor  Measures 
including  this  factor  have  been  proposed,  see.  for 
example.  [3] 


5  RECOGNITION  ACCURACY  AND 
GRAMMATICAL  CONSTRAINTS 

In  this  section.  we  compare  recognition 
performance  using  grammars  differing  in  the  degree  to 
which  they  constrain  the  set  of  allowable  word 
sequences  We  began  with  a  grammar  designed  to  cover 
a  structural  subset  of  the  Email  sentences,  the 
commands  A  goal  of  this  grammar  was  to  maximize 
coverage  of  these  sentences  plus  logical  extensions 
suited  to  the  Email  task  environment  Equally  important 
in  the  design  of  this  grammar  was  the  minimization  of 
over-generation",  l  e  .  the  generation  or  acceptance  of 
many  ungrammatical  sentences 

Our  interest  in  grammars  is  broader  than  simply 
improving  performance  on  a  given  task  In  addition,  we 
would  like  to  investigate  the  trade-off  in  performance 
versus  over -generation,  and  to  estimate  performance  on 
more  difficult  tasks,  i  e  tasks  requiring  a  larger  number 
of  choices  at  various  points  in  the  grammar  We 
therefore  designed  a  second  grammar  for  the  commands 
a  grammar  with  greater  perplexity  Similarly  we 
designed  two  grammars  differing  in  perplexity  for  the 
entire  set  of  sentences  (commands  as  well  as  questions) 


5.1  Integration  of  Grammatical  Constraints  in  the 
Recognition  System 

We  approached  the  implementation  of  a  grammar  in 
our  recognition  system  in  two  steps  First  we  created  a 
description  of  the  Email  task  language  in  a  modified 
context-free  notation  This  description  was  based  on 
the  100  sentences  mentioned  earlier,  and  was  designed 
to  capture  generalizations  of  the  linguistic  phenomena 
found  in  them  Second,  we  created  tools  that 
transformed  this  description  into  structures  in  our 
recognizer  that  provide  the  corresponding  grammatical 
constraint  These  tools  provide  us  with  a  general 
facility  for  capturing  in  our  recognition  system  an 
approximation  of  any  language  expressible  in  context- 
free  rules  We  chos»  to  implement  the  constraints  in 
the  recognition  system  in  the  form  of  a  finite  automaton 
(FA)  similar  to  those  described  in  [4]  and  [l] 

At  the  first  stage  in  generating  a  grammar  we  use 
a  context-free  notation  augmented  with  variables  in 
order  to  simplify  the  process  of  describing  a  language 
For  example,  this  notation  would  allow  a  rule  that  says 
a  noun  phrase  of  any  number  can  be  replaced  by  an 
article  and  a  noun  of  the  same  number,  whereas 
ordinary  context-free  notation  would  require  two  rules 
that  are  identical  except  that  one  would  be  for  singular 
number  and  the  other  for  plural 

Our  system  first  translates  the  augmented  notation 
into  ordinary  context-free  rules  and  then  constructs  a 
FA  based  on  these  rules  While  it  is  true  that  context- 
free  grammars  can  accept  recursive  languages  which 
finite  automata  cannot,  finite  automata  can  approximate 
recursion  by  setting  upper  limits  on  the  number  of 
levels  of  recursion  allowed  Such  an  approximation  :s 
reasonable  for  most  task  languages,  since  spoken 
sentences  do  not  ordinarily  use  more  than  a  few  levels 
of  recursion 

In  our  recognition  system,  the  automaton  is  used 
as  follows  Associated  with  each  transition  in  the  FA  is 
a  hidden-Merkov  word  model  that  is  used  to  compute 
the  probability  of  a  spectral  sequence  given  the 
occurrence  of  the  word  at  that  place  in  the  grammar 
The  recognition  algorithm  with  this  grammar  is  or.lv 
slightly  different  from  the  version  of  the  algorithm  that 
allows  any  sequence  of  words  [2]  For  each  10  ms  frame 
of  the  input  speech,  the  scores  for  all  the  word  models 
in  the  FA  network  are  updated  according  to  a  modified 
Baum-Welch  algorithm  The  score  for  the  start  state  of 
the  FA  is  unity  and  the  score  for  every  other  FA  slate 
is  simply  the  maximum  of  all  the  word  model  scores  that 
enter  the  state  along  FA  transitions  This  state  snr* 
in  turn,  is  propagated  to  the  beginning  of  all  the  word 
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models  on  transitions  leaving  the  state,  to  be  used  as 
the  new  initial  score  for  those  models  In  this  way  the 
recognizer  only  considers  grammatical  sequences  of 
words  Maintained  throughout  this  scoring  process  are 
traceback  pointers  that  indicate  for  each  state  and  each 
time  the  word  model  that  produced  the  best  score  to 
enter  the  state  Once  an  utterance  is  thus  processed,  it 
is  a  simple  matter  to  follow  these  pointers  back  through 
the  network  to  find  the  highest  scoring  sequence  of 
words 

One  potential  difficulty  with  a  FA  grammar  for 
recognition  stems  from  the  fact  that,  ordinarily, 
computation  is  proportional  to  the  number  of  transitions 
in  the  FA  This  number  can  become  quite  large  for 
complex  languages  However,  in  our  experience  with 
grammars  for  the  Email  task,  a  simple  time-synchronous 
search  with  pruning  [5]  effectively  reduces  the 
computation  to  less  than  that  for  the  algorithm  that 
does  not  use  a  grammar,  without  affecting  performance 


5.2  Description  of  the  Grammars  and  Methodology 

We  compare  here  the  effects  on  performance  of 
grammars  differing  in  which  set  of  sentences  they  are 
intended  to  cover  (t$ie  full  set  of  test  sentences  or  the 
commands  only)  and  along  a  dimension  we  call  tight- 
loose.  which  refers  to  an  estimate  of  how  much  over¬ 
generation  is  produced  by  the  grammar  "Tight'' 
grammars  have  very  little  over-generation  (generation  of 
sentences  that  are  considered  ungrammatical)  and, 
because  of  these  tighter  constraints,  tend  to  have  fewer 
choices  at  various  points  in  the  grammar.  1  e  ,  smaller 
perplexity  "Loose''  grammars,  on  the  other  hand,  have  a 
great  deal  of  over-generation  and  greater  perplexity 
(larger  sets  of  choices  at  various  states)  The  loose 
grammars  developed  here  are  loose  in  that,  for  example, 
no  number,  tense,  case  or  semantic  agreement  is 
required 

The  grammars  we  have  investigated  so  far  include 
a  tight  and  a  loose  grammar  for  commands  (COM-T  and 
COM-L.  respectively)  and  a  loose  grammar  that  covers 
both  commands  and  questions  (SENT-L).  In  addition,  we 
have  used  another  grammar  that  is  tighter  than  SENT-L 
(and  hence  is  called  SENT-T).  but  only  in  aspects  that 
would  otherwise  put  into  similar  grammatical  distribution 
large  sets  of  minimal  pairs  For  example,  singular 
versus  plural  nouns,  the  cardinals  versus  ordinals,  or 
verb  tenses  all  involve  large  sets  of  acoustically  similar 
items  This  fact  can  pose  a  problem  for  recognition  if 
the  grammar  allows  many  sequences  in  which  one 
member  of  the  pair  can  be  substituted  for  the  other 


On  the  other  hand,  distinguishing  verbs  on  the  basis  of 
which  objects  they  take  reduces  perplexity  without 
necessarily  reducing  the  number  of  acoustically  similar 
competing  words 

Table  I  shows  the  relevant  attributes  of  the 
grammars  investigated  For  comparison,  the  results  for 
no  grammar  (the  trivial  grammar  that  allows  any  lexical 
item  to  occur  anywhere)  are  also  included  The  table 
includes  the  number  of  arcs  (a  rough  measure  of  size 
and  is  related  to  computation  time),  the  three  estimates 
of  perplexity  (Maximum  Perplexity.  Test  Set  Branching 
Factor  and  Uniform  Branching)  This  table  also  shows 
the  number  of  words  and  number  of  sentences  on  which 
each  grammar  was  tested,  and  the  performance  for  each 
Word  accuracy  here  is  computed  as  the  sum  of  all  errors 
(insertions  +  deletions  +  substitutions)  divided  by  the 
sum  (total  words  +  insertions)  Sentence  accuracy  is 
also  included  in  order  to  show  that  a  few  percentage 
points  difference  in  word  accuracy  can  result  in  much 
larger  differences  in  the  number  of  correctly  recognized 
sentences,  a  number  that  is  no  doubt  very  important  to 
potential  users. 

Since  we  had  used  30  of  the  100  test  sentences  in 
previous  experiments  and  modified  our  system  as  a 
function  of  those  results,  we  used  only  the  subset  of  TC 
remaining  sentences  for  the  ’  performance  figures 
reported  here  In  order  to  compare  the  tight  and  loose 
versions  of  the  grammars,  performance  was  assessed 
using  the  intersection  of  the  sentences  parsed  by  each 
grammar  Results  are  based  on  using  the  phone-left- 
and-nght  word-model  discussed  in  [2] 


5.3  Results  and  Discussion 

Figures  2a  (commands  only)  and  2b  (commands  and 
questions)  show  graphically  the  word  accuracy  figures  of 
Table  I  associated  with  each  grammar  Performance  is 
plotted  as  a  function  of  the  perplexity  estimates  used 
As  can  be  seen,  these  grammars  differ  in  their  effects 
on  performance  Further,  when  two  grammars  that  cover 
the  same  set  of  sentences  are  compared  (COM-T  versus 
COM-L  or  SENT-T  versus  SENT-L),  the  more  constrained 
grammar  has  significantly  better  word  accuracy  than  the 
less  constrained  one  tightening  of  the  command 
grammar  improved  performance  from  95  5f.  to  98  4” 
tightening  of  the  sentence  grammar  improved 
performance  from  96  27.  to  98  27.  Word  accuracy  again 
includes  as  errors  all  insertions.  deletions  and 
substitutions  Further,  it  appears  that  grammatical 
constraints  that  take  into  account  acoustic  similarity 
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Word  Accuracy  (») 


TABLE  I 


Properties  of  the  Grammars 


GRAMMAR 

COM-L 

COM-T 

SENT-L 

SENT-T 

NONE 

Number  of  arcs 

836 

7187 

2547 

3771 

Maximum  Perplexity 

58 

19 

75 

60 

334 

Test  Set  Branching 

40 

18 

47 

31 

334 

Uniform  Branching 

19 

9 

22 

19 

334 

Words  in  test  set 

183 

183 

438 

438 

492 

Sentences  in  test  set 

27 

27 

63 

63 

70 

Sentence  accuracy 

72 . 9% 

90  .  15! 

80  .  55! 

90.25 

36 . 75! 

Word  Accuracy 

95 . 55! 

98.45! 

96 . 25! 

98 . 25! 

86.65! 

Comparison  of  the 

various 

grammars 

used  for 

the  commands 

( t ight 

coverage,  COM-T .  loose  coverage.  COM-L)  and  the  commands  plus 
questions  (tight  coverage.  SENT-T;  loose  coverage,  SENT-L)  .  Word 
accuracy  here  is  computed  as  (insertions  +  deletions  +  substitutions) 
divided  by  (total  words  +  insertions). 


(a) 


(b) 


Figure  2:  Performance  with  Grammars.  Plotted  Is  performance,  (Insertions  +  deletions  +  substitutions' 
divided  by  (number  of  words  +  insertions),  as  a  function  of  perplexity  as  estimated  by  the 
-.siiform  '•’•anching  assumption  (X),  the  test  sec  branching  factor  (squares),  and  the  maximum 
perplexity  (circles).  (a)  The  tightly  constrained  command  grammar  (COM-T)  and  its  loose 
counterpart  (COM-L).  (b)  The  tightly  constrained  sentence  grammar  (SENT-T)  and  its  loose 
counterpart  (SENT-L),  which  considers  acoustic  similarity. 
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References 


improve  performance  more  than  those  that  do  not  for 
comparable  estimated  perplexity  the  SENT-L  grammar 
improves  performance  more  than  its  estimated  perplexity 
would  predict  if  acoustic  similarity  had  not  been  an 
important  factor 

An  analysis  of  the  recognition  errors  using  these 
various  grammars  reveals  that,  in  general,  acoustically 
similar  items  are  confused  It  does  not  appear  that 
function  words  are  more  often  involved  in  the  errors 
than  content  words  A  large  percentage  of  our  errors 
<322  for  SENT-T)  involve  "the  and  a  which  happen  to 
be  function  words  However  no  other  function  words 
show  this  pattern  We  believe  that  "the"  and  a  show 
up  more  often  in  the  errors  NOT  because  they  are 
function  words,  but  because  they  are  (1)  acoustically 
similar.  (2)  have  similar  grammatical  distributions,  and 
(3)  are  very  frequent  words  in  these  sentences 
Assuming  that  we  cannot  change  their  acoustic  similarity 
or  their  lexical  frequency,  improving  performance  on 
these  words  requires  a  more  constrained  specification  of 
their  distribution  in  the  linguistic  model  It  is  possible 
that  semantic,  pragmatic  or  discourse  models  could 
separate  the  two  distributions,  given  a  well-defined  task 
environment. 


6  CONCLUSIONS  AND  FUTURE 
RESEARCH 

We  have  implemented  and  tested  methods  of 
combining  grammatical  and  acoustic  knowledge  sources 
in  our  recognition  algorithm  We  find  that  the  use  of 
grammatical  constraints  can  decrease  the  error  rate  by 
a  factor  of  more  than  six  This  result  corresponds  to  a 
word  accuracy  (counting  all  insertions,  substitutions  and 
deletions  as  errors)  of  more  than  982  for  the  Email 
task.  Reducing  the  number  of  words  considered  by  the 
recognizer  boosts  performance,  even  when  the  amount  of 
training  per  word  is  fixed  We  have  presented  various 
estimates  of  grammatical  perplexity  and  shown  that 
performance  improves  as  estimated  perplexity  decreases 
for  a  given  task  Our  experience  with  a  grammar  that 
focuses  only  on  syntactic  constraints  in  acoustically 
confusable  portions  of  the  grammar  demonstrates  the 
importance  of  acoustic  similarity  in  predicting 
performance  accurately  and  in  improving  recognition 
performance 
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Abstract 


In  this  paper,  we  describe  BYBLOS,  the  BBN  continuous 
speech  recognition  system.  The  system,  designed  for  large 
vocabulary  applications,  integrates  acoustic,  phonetic,  lexical, 
and  linguistic  knowledge  sources  to  achieve  high  recognition 
performance.  The  basic  approach,  as  described  in  previous 
papers  [1,2).  makes  extensive  use  of  robust  context-dependent 
models  of  phonetic  coarticulation  using  Hidden  Markov 
Models  (HMM).  We  describe  the  components  of  the  B  YBLOS 
system,  including:  signal  processing  frontend,  dictionary, 
phonetic  model  training  system,  word  model  generator, 
grammar  and  decoder.  In  recognition  experiments,  we 
demonstrate  consistently  high  word  recognition  performance 
on  continuous  speech  across:  speakers,  task  domains,  and 
grammars  of  varying  complexity.  In  speaker-dependent  mode, 
where  15  minutes  of  speech  is  required  for  training  to  a 
speaker.  98.5%  word  accuracy  has  been  achieved  in  continuous 
speech  for  a  350- word  task,  using  grammars  with  perplexity 
ranging  from  30  to  60.  With  only  15  seconds  of  training 
speech  we  demonstrate  performance  of  97%  using  a  grammar. 

1.  Introduction 

Speech  is  a  natural  and  convenient  form  of 
communication  between  man  and  machine.  The  speech  signal, 
however,  is  inherently  variable  and  highly  encoded.  Vast 
differences  occur  in  the  realizations  of  speech  units  related  to 
context,  style  of  speech,  dialect,  talker.  This  makes  the  task  of 
large  vocabulary  continuous  speech  recognition  (CSR)  by 
machine  a  very  difficult  one.  Fortunately,  speech  is  also 
structured  and  redundant:  information  about  the  linguistic 
content  in  the  speech  signal  is  often  present  at  the  various 
linguistic  levels.  To  achieve  acceptable  performance,  the 
recognition  system  must  be  able  to  exploit  the  redundancy 
inherent  in  the  speech  signal  by  bringing  multiple  sources  of 
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designing  a  large  and  complex  system  for  continuous  speech 
recognition.  This  paper  is  organized  as  follows  Section  2 
N  continuous  gives  an  overview  of  the  BYBLOS  system.  Section  3 

ted  for  large  describes  our  signal  processing  frontend.  Section  4  describes 

netic,  lexical,  the  trainer  system  used  for  phonetic  model  knowledge 

h  recognition  acquisition.  Section  5  describes  the  word  model  generator 

1  in  previous  module  that  compiles  word  HMMs  for  each  lexical  item 

:xt-dependent  Section  6  describes  the  syntactic/grammatical  knowledge 
den  Markov  source  that  operates  on  a  set  of  context-free  rules  describing 

the  BYBLOS  the  task  domain  to  produce  an  equivalent  finite  state  automaton 

1,  dictionary,  used  in  the  recognizer.  Section  7  describes  the  BYBLOS 

el  generator,  recognition  decoder  using  combined  multiple  sources  of 

eriments,  we  knowledge.  Finally,  Section  8  presents  some  figures  and 

performance  discussions  on  BYBLOS  recognition  performance. 


2.  Bvblos  System  Overview 

Figure  1  is  a  block  diagram  of  the  BYBLOS  continuous 
speech  recognition  system.  We  show  the  different  modules 
and  knowledge  sources  that  comprise  the  complete  system,  the 
arrows  indicating  the  flow  of  module/KS  interactions  The 
modules  are  represented  by  rectangular  boxes.  They  are, 
starting  from  the  top:  Trainer,  Word  Model  Generator,  and 
Decoder.  Also  shown  are  the  knowledge  sources,  which  are 
represented  by  the  ellipses.  They  include:  Acoustic-Phonetic, 
Lexical,  and  Grammatic  knowledge  sources.  We  will  describe 
briefly  the  various  modules  and  how  they  interact  with  the 
various  KSs- 

Acoustic-Phonetic  KS 

The  Trainer  module  is  used  for  the  acquisition  of  the 
acoustic-phonetic  knowledge  source.  It  takes  as  input  a 
dictionary  and  training  speech  and  text,  and  produces  a 
database  of  context-dependent  HMMs  of  phonemes. 

Lexical  KS 


knowledge  to  bear.  In  general,  these  can  include:  acoustic- 
phonetic,  phonological,  lexical,  syntactic,  semantic  and 
pragmatic  knowledge  sources  (KS).  In  addition  to  designing 
representations  for  these  KSs.  methodologies  must  be 
developed  for  interfacing  them  and  combining  them  into  a 
uniform  structure.  An  effective  and  coherent  search  strategy 
can  then  be  applied  based  on  global  decision  criteria.  Practical 
issues  that  need  to  be  resolved  include  computation  and 
memory  requirements,  and  how  they  could  be  traded  off  to 
obtain  the  desired  combination  of  speed  and  performance. 


The  Word  Model  Generator  module  takes  as  input  the 
phonetic  models  database,  and  compiles  word  models  phonetic 
models.  It  uses  the  dictionary  -  the  lexical  KS,  in  which 
phonological  rules  of  English  are  used  to  represent  each  lexical 
item  in  terms  of  their  most  likely  phonetic  spellings.  The 
lexical  KS  imposes  phonotactic  contraints  by  allowing  only 
legal  sequences  of  phonemes  to  be  hypothesized  in  the 
recognizer,  reducing  the  search  space  and  improves 
performance-  The  output  of  the  Word  Model  Generator  is  a 
database  of  word  models  used  in  the  recognizer. 


In  BYBLOS,  we  have  explored  many  issues  that  arise  in 


Grammatical  KS 

More  recently,  we  have  been  working  on  representation 
and  integration  of  higher  levels  of  knowledge  sources  into 
BYBLOS.  including  both  syntactic  and  semantic  KSs.  By 
incorporating  both  of  these  KSs  into  B\  BLOS  in  the  form  of  a 
grammar  into  our  recognizer,  we  demonstrate  improved 
recognition  performance  In  Section  6,  we  describe  the 
Grammatical  KS  in  more  detail. 


Speech  Text 


Figure  I:  BYBLOS  System  Diagram. 


3.  Signal  Processing  and  Analysis  Component 

The  BYBLOS  signal  processing  frontend  performs 
feature  extraction  for  the  acoustic  models  used  in  recognition. 
Sentences  are  read  directly  into  a  close  talking  microphone  in  a 
natural  but  deliberate  style  in  a  normal  office  environment. 
The  input  speech  is  lowpass  filtered  at  10  kHz  and  sampled  at 
20  kHz.  Fourteen  Mel-frequency  cepstral  coefficients  (MFCC) 
are  computed  from  short-term  spectra  every  10  ms  using  a  20 
ms  analysis  window.  This  MFCC  feature  vector  is  then  vector 
quantized  to  an  8-bit  (256  bins)  representation.  The  vector 
quantization  (VQ)  codebook  is  computed  using  the  k-means 
clustering  algorithm  with  about  5  minutes  of  speech.  We 
perform  a  variable-frame-rate  (VFR)  compression  tn  which 
strings  of  up  to  3  identical  vector  codes  are  compressed  to  a 
single  observation  code.  We  found  this  VFR  procedure  speeds 
up  computation  with  no  loss  in  performance 


4.  Training/Acquisition  Of  Phonetic 
Coarticulation  Models 

The  training  system  in  BYBLOS  acquires  and  estimates 
the  phonetic  coaniculation  models  used  in  recognition  Given 


that  we  model  speech  parameters  as  probabilistic  functions  of  a 
hidden  Markov  chain,  we  make  use  of  the  Baum-Welch  (also 
known  as  the  Forward-Backward)  algorithm  (3)  to  estimate  the 
parameters  of  the  HMMs  automatically  from  spoken  speech 
and  corresponding  text  transcription.  For  each  training 
utterance,  the  training  system  takes  speech  and  text,  and  builds 
a  network  of  phonemes  using  the  dictionary.  It  first  builds  the 
phonetic  network  for  the  word  by  using  the  phonetic 
transcription  provided  by  the  dictionary.  The  phonetic  network 
is  expanded  into  a  triphone  network  so  that  each  arc 
completely  defines  a  phonetic  context  up  to  the  triphone. 
These  triphone  networks  of  the  word  are  then  concatenated  to 
form  a  single  network  for  the  sentence,  which  in  general  can 
take  into  account  within  word  as  well  as  across-word 
phonological  effects.  The  training  system  then  compiles  a  set 
of  phonetic  context  models  for  each  triphone  arc  in  the 
network.  It  then  runs  the  forward-backward  algorithm  to 
estimate  the  parameters  of  the  phonetic  context  models.  The 
Trainer  operates  in  two  modes:  speaker-dependent  and 
speaker-adapted.  Associated  with  these  two  modes  are  two 
distinct  methods  for  training  the  parameters  of  the  hidden 
Markov  models  described  below. 

Speaker-Dependent 

This  is  the  algorithm  used  to  find  the  parameters  of  the 
HMMs  that  maximizes  the  probability  of  the  observed  data 
given  the  model.  This  method  produces  HMMs  that  are  finely 
tuned  to  a  particular  speaker,  therefore  in  general  would  work 
well  only  for  this  speaker.  Typically  about  15  minutes  of 
speech  from  a  speaker  is  required  for  speaker-dependent 
training 

Speaker- Adapted 

This  is  a  new  method  of  training  that  transforms  HMM 
models  of  one  speaker  to  model  the  speech  of  a  second  speaker 
(4].  This  procedure  estimates  a  probabilistic  spectral  mapping 
from  a  well-trained  prototype  speaker  to  a  new  speaker  Using 
this  method  it  is  possible  for  a  new  speaker  to  used  the  system 
with  as  little  as  15  seconds  of  speech 


5.  Word  Model  Generator 

Prior  to  recognition,  word  HMMs  are  computed  for  each 
word  in  the  vocabulary.  The  word  model  generator  takes  as 
input  two  objects:  a  database  of  phonetic  HMMs  as  obtained 
in  training,  and  a  dictionary  that  contains  phonetic  spellings  for 
each  word.  For  each  phoneme  in  each  word  of  the  lexicon,  it 
first  finds  in  the  phonetic  HMM  database  all  the  context 
models  that  are  relevant  to  this  phoneme  in  its  particular 
phonetic  environment.  It  then  combines  this  set  of  phonetic 
models  with  appropriate  weights  to  produce  a  single  HMM  for 
each  phoneme  in  the  word  This  combination  process  saves 
computation  by  precompiling  the  many  levels  of  phonetic 
context  models  that  cam  occur  for  a  given  phonetic  context  into 
a  single  representation.  The  output  of  the  word  model 
generator  is  a  database  of  word  HMMs  serving  as  the  input  to 
the  decoder. 
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6.  Grammatical  Know  ledge  Source 

To  solve  the  CSR  problem  requires  major  advances  in 
two  areas:  acoustic  modeling  and  language  modeling.  A  good 
acoustic  model  is  essential  in  making  fine  phonetic  distinctions 
when  needed.  However,  it  is  not  sufficient  by  itself  to  solve 
the  CSR  problem.  In  a  complex  task  with  large  vocabulary 
where  the  number  of  hypothesized  word  candidates  is  large, 
the  probability  for  acoustic  confusability  can  be  high,  and  the 
recognizer  could  make  errors.  A  conceptually  simple  yet 
effective  way  to  restrict  the  number  of  words  that  are  allowed 
to  be  hypothesized,  and  therefore  decrease  probability  of 
acoustic  similarity,  is  to  incorporate  a  grammar  into  the 
recognizer.  It  is  well  known  that  recognition  performance 
improves  as  vocabulary  size  decreases.  Similarly,  when 
syntactical  information  is  used  to  reduce  the  number  of  words 
that  can  legally  follow  a  given  sequence  of  words,  a  recognizer 
is  expected  to  make  fewer  errors.  The  purpose  for  using  a 
grammar  then,  is  to  improve  recognition  performance,  with  an 
added  benefit  of  reducedcomputation. 

Grammar  Design  and  Implementation 

We  approach  the  implementation  of  a  grammar  in 
BYBLOS  in  two  stages.  First,  we  create  a  description  of  the 
task  domain  language  using  a  modified  context-free  notation. 
Typically  this  description  is  based  on  a  representative  set  of 
sentences  that  characterizes  the  task  domain,  and  is  designed  , 
capture  generalizations  of  the  linguistic  phenomena  found  in 
them.  Second,  we  use  a  tool  that  transforms  this  description 
into  structures  in  our  recognizer  that  provide  the  corresponding 
grammatical  constraints  This  tool  provides  us  with  a  general 
facility  for  capturing  in  BYBLOS  an  approximation  of  any 
language  expressible  in  context-free  grammars  (CFG) 
expressed  as  context-free  rules.  We  elected  to  implement  the 
grammatical  constraints  in  the  form  of  a  finite  state  automaton 
(FA)  similar  to  those  described  in  [5] 

At  the  first  stage  in  generating  a  grammar,  we  use  a 
context-free  notation  augumented  with  variables  in  order  to 
simplify  the  process  of  describing  a  language.  For  example, 
this  notation  would  allow  a  rule  that  says  a  noun  phrase  of  any 
number  can  be  replaced  by  an  ankle  and  a  noun  of  the  same 
number;  ordinary  context-free  notation  would  require  two  rules 
that  are  identical  except  that  one  would  be  for  singular  number 
and  the  other  for  plural. 

Our  system  first  translates  the  augmented  notation  into 
ordinary  CFGs  and  then  constructs  a  FA  based  on  these  rules. 
Because  context-free  grammars  can  accept  recursive  languages 
and  a  FA  cannot,  recursion  is  approximated  in  the  FA  by 
limiting  the  number  of  levels  of  recursion.  Such  an 
approximation  is  reasonable  for  most  task  languages,  since 
spoken  sentences  do  not  ordinarily  use  more  than  a  few  levels 
of  recursion. 

7.  Recognition  Search  Strategy 

Once  the  FA  is  compiled  from  the  context-free 
description  of  the  task  domain,  it  is  ready  to  be  used  in  the 
decoder  An  important  characteristic  of  a  recognizer  is  the 
search  strategy  that  is  used  to  find  the  word  sequence  that  best 


matches  the  input  speech.  We  believe  that  an  optimum  search 
strategy  avoids  making  local  decisions;  the  search  decision 
should  be  made  globally,  based  on  scores  from  all  the  KSs. 
One  such  search  paradigm  is  the  one  used  in  BYBLOS:  the 
search  is  made  top  down,  linguistically  driven,  with  tightly 
coupled  KSs. 

The  FA  is  convenient  for  deploying  such  a  search 
strategy.  It  is  used  as  follows  in  our  recognizer.  We  associate 
with  each  transition  in  the  FA  a  hidden  Markov  model  for  the 
word.  This  model  is  used  to  compute  the  probability  of  the 
acoustic  event  (sequence  of  VQ  spectra)  given  the  occurrence 
of  the  word  at  that  place  in  the  grammar.  Before  the  start  of 
recognition,  the  initial  state  of  the  FA  where  a  legal  sequence 
of  words  can  begin  is  initialized  to  unity,  and  all  the  other 
states  are  initialized  to  zero.  For  each  10  ms  frame  of  the  input 
speech,  the  scores  for  the  states  in  all  the  words  in  the  FA 
network  are  updated  using  modified  Baum-Welch  algorithm 
[2],  In  addition  to  state  updates  within  a  word,  a  word  can 
have  a  score  propagated  to  its  initial  state  from  its  best  scoring 
predecessor  word.  This  simple  state  update  operation  is 
repeated  every  10  ms  for  each  FA  transition  until  the  end  of  the 
utterance  is  reached.  The  decoder  output  is  then  computed  by 
tracing  back  through  the  FA  network  to  find  the  highest 
scoring  sequence  of  words  that  end  in  the  terminal  state  of  the 
FA. 

One  potential  problem  associated  with  using  a  FA 
grammar  for  recognition  is  that  computation  is  expected  to  be 
proportional  to  the  number  of  transitions  in  the  FA.  This 
number  can  be  quite  large  for  complex  languages.  However,  in 
our  experience  with  different  grammars  in  our  recognizer,  we 
find  that  a  beam  search  effectively  reduces  the  computation  to 
a  very  manageable  level  while  maintaining  the  same 
performance  as  that  of  an  exhaustive  search. 

8.  Byblos  Recognition  Performance 

ln|/j,  we  presented  word  recognition  results  for  a  334- 
word  electronic  mail  task.  In  speaker-dependent  mode,  we 
demonstrated  performance  of  90%  across  several  speakers 
without  the  use  of  a  grammar  (i.e.,  branching  factor  of  334). 
Since  then,  we  have  tested  the  system  along  many  dimensions: 
two  task  domains,  FA  grammars  with  varying  perplexities, 
varying  amounts  of  adaptation  speech,  and  different  speaker 
types.  The  results  are  tabulated  in  Figure  2.  Below  we 
describe  the  different  conditions  in  more  detail 
Task  Domains 

The  two  task  domains  tested  are:  Electronic  Mail 
(EMAIL)  and  Naval  Database  Retrieval  (FCCBMP).  Both 
tasks  have  vocabulary  sizes  of  approximately  350  word  (334 
for  EMAIL,  354  for  FCCBMP).  A  description  of  the  task 
domain  language  was  created  using  CFG.  The  CFGs  were 
designed  to  capture  generalizations  of  linguistic  phenomena 
found  in  example  task  domain  sentences. 

Grammars 

Two  finite  state  grammars  were  generated  for  each  task 
domain:  Command  and  Sentence.  The  Command  Grammar  in 
each  case  was  designed  to  cover  only  the  command  subset  of 
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Figure  2:  BYBLOS  Recognition  Results. 

Two  task  domains  (EMAIL  and  FCCBMP), 

two  grammars  for  each  task 

(Command  and  Sentence),  and  varying 

amounts  of  training  speech 

(2  minutes  and  15  minutes).  Also  shown  are 

maximum  perplexity  measures  for  the  grammars 

the  language,  the  Sentence  Grammar  was  designed  to  cover  all 
of  the  language,  which  included  both  command  and  question 
type  constructs.  The  maximum  perplexity  measures  of  the 
grammars,  as  proposed  in  [6],  are  shown  in  Figure  2.  In  both 
tasks,  the  sentence  grammars  have  a  higher  perplexity  than 
their  command  counterparts 

Adaptation  Time 

As  described  in  Section  2,  The  BYBLOS  operate  in  two 
modes,  speaker-dependent  and  speaker-adapted.  In  speaket- 
dependent  mode,  15  minutes  of  training  speech  is  required  for 
a  speaker.  This  mode  in  general  will  give  word  accuracy  in  the 
98  5+  range.  In  the  speaker-adaptive  mode,  anywhere  from  2 
minutes  down  to  15  seconds  of  speech  from  a  new  speaker  is 
nee  led  to  "adapt"  the  HMM  parameters  to  the  new  speaker. 
The  performance  m  this  case  is  97% 

Speaker  Type 

We  have  tested  BYBLOS  on  several  speakers  with 
different  dialects,  including  a  female  speaker,  a  non-native 
speaker,  and  3  naive  (uncoached)  speakers  The  recognition 
results  for  these  speakers  showed  little  deviation  typical  male 
speakers  of  standard  American  dialects 
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9.  Summary 

We  have  presented  BYBLOS,  a  system  for  large 
vocabulary  continuous  speech  recognition.  We  showed  how 
we  integrate  multiple  sources  of  knowledge  to  achieve  high 
recognition  performance.  In  recognition  experiments,  we 
demonstrated  consistent  performances  across  task  domains, 
grammars,  adaptation  time,  and  speaker  type 

We  are  currently  working  to  improve  various  aspects  of 
the  system,  including:  a  real  time  implementation  of  the 
recognizer,  search  strategy,  acoustic  modeling,  and  language 
modeling  In  the  future,  we  plan  to  work  on  integration  of 
speech  and  natural  language  for  speech  understanding 
applications 
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ABSTRACT 

A  database  of  continuous  read  speech  lias  been  designed  and 
recorded  within  the  DARPA  Strategic  Computing  Speech  Rec"g- 
nition  Program  The  data  is  intended  for  use  in  designing  and 
evaluating  algorithms  for  speaker-independent  speaker-adaptive 
speech,  and  speaker  dependent  speech  recognition  1  he  data  con 
sists  of  read  sentences  appropriate  to  a  naval  resource  manage 
ment  task  built  around  existing  interactive  database  and  graphics 
programs.  The  IRIJO- word  task  v  •cabularv  is  intended  to  be  log 
icallv  complete  and  habitable  I  be  database,  which  represents 
over  21 .0U0  recorded  utterances  fn  an  I  SO  talkers  with  a  variet  v  of 
dialects,  includes  a  partition  of  sentences  and  talkers  for  'raining 
and  for  testing  purposes 

1  Introduction 

The  development  of  robust,  reliable  speech  recognition 
svstems  depends  on  the  availability  of  reabstic.  well designed 
databases,  the  technical  and  commercial  community  can  bene¬ 
fit  greatly  when  different  systems  are  evaluated  with  reference  to 
the  same  benchmark  material  The  DARPA  1000  word  resource 
management  database  was  designed  to  provide  such  benchmark 
materials  it  consists  of  consistent  but  unmnfounded  (raining 
and  test  materials  that  sample  a  realistic  and  habitable  task  do¬ 
main,  and  cover  a  broad  range  of  speakers  The  goal  of  this 
database  collection  effort  was  to  yield  a  set  of  data  to  promote 
the  developmr..t  of  useful  large  vocabulary,  continuous  speech 
recognition  algorithms  We  hope  that  this  description  will  serve 
both  to  publicize  the  existence  of  the  database  and  its  availability 
for  use  in  benchmark  tests,  and  to  describe  the  methods  used  in 
its  construction 

The  database  includes  materials  appropriate  to  a  naval  re 
source  management  task  The  1000  vocabulary  items  and  2800 
resource  management  sentences  are  based  on  interviews  with 
naval  personnel  familiar  with  ar.  existing  test-bed  database  and 
accompanying  software  to  access  and  display  information  160 
subjects,  representing  a  wide  variety  of  I'S  dialects,  read  sentence 
materials  including  2  “dialect  sentences''  (i  e  .  sentences  that  con¬ 
tained  many  known  dialect  markers).  1.)  “rapid  adaptation  sen¬ 
tences'  (designed  to  cover  a  variety  of  phonetic  contexts),  2800 
“resource  management  "  sentences  and  600  "spell  mode"  phrases 
(words  spoken  and  then  spelled)  The  database  is  divided  into 
a  speaker  independent  part  and  a  speaker  dependent  part,  both 
are  divided  into  training  and  test  portions  The  test  portions 
are  further  divided  into  equal  sub  parts  for  initial  testing  during 
system  development  (“development  test”),  and  later  evaluation 
(  evaluation  test") 


The  methods  build  on  and  extend  work  by  Leonard  :3|, 
Fisher  el  of.  (2)  and  Bernstein,  Kahn  and  Poza  (lj.  Original 
contributions  of  the  current  work  include  methods  for  designing 
the  vocabulary  and  sentence  set,  speaker  selection,  and  distribu¬ 
tion  of  sentence  material  among  the  speakers 

The  database  design  and  implementation  included  spend 
cation  of  a  realistic  and  reasonable  task  domain,  selection  of  a 
habitable  1000-word  vocabulary,  construction  of  sentences  to  rep 
resent  the  syntax,  semantics,  and  phonology  of  the  task,  selection 
of  a  dialertally  diverse  set  of  subjects,  assignment  of  subjects  to 
sentences,  recording  of  the  subjects  reading  the  s.-n'cim-s  ami 
implementation  of  a  si  stem  for  the  distribution  and  use  -  f  the 
database  1  liese  tasks  are  described  in  more  detail  Ik]>  is 

2  Task  Design 

2.1  Task  Domain  Specification 

We  chose  a  database  query  task  because  it  is  a  natural  place 
to  use  speech  recognition  technology  as  a  human- machine  in 
terface  To  define  realistic  constraints,  and  allow  for  eventual 
demonstrations  of  this  technology,  we  based  the  task  on  the  use 
of  an  existing,  unclassified  test-bed  database  and  an  interactive 
graphics  program  The  chosen  task  has  the  additional  advantage 
that  it  has  been  the  basis  of  much  research  and  development 
111  the  natural  language  understanding  community  The  value  of 
speech  recognition  technology  is  enhanced  bv  its  integration  with 
a  natural  language  understanding  component. 

The  current  phase  of  the  DARPA  speech  recognition  pro 
gram  specifies  a  1000-word  vocabulary.  The  test-bed  database 
however,  has  a  substantially  larger  vocabulary  size,  and  tberef  to 
bad  to  be  restricted  Our  philosophy  in  selecting  a  lOOil  word 
subset  was  to  limit  the  number  of  database  fields,  rather  than  o 
linut  the  ways  a  user  might  access  the  information  The  fields 
.selected  include  information  about  various  types  of  slops  and  as 
sociated  properties  locations,  propulsion  types,  fm=l .  sizes.  Ilret 
identifications,  schedules,  speeds,  equipment  availability  and  si  a 
tus.  The  interactive  graphics  commands  include  various  wass  T 
displaying  maps  and  ship  locations 

An  initial  set  of  1200  resource  management  sentences  came 
from:  (1)  preliminary  interviews  with  naval  personnel  familiar 
with  the  test-bed  database  and  the  software  for  accessing  it.  and 
(2)  systematic  coverage  of  the  database  fields,  subject  to  review 
by  the  naval  personnel  in  follow-up  interviews  These  sentences 
were  intended  to  provide  wide  coverage  of  the  syntactic  and  se 
mantic  attributes  of  expected  sentences,  rather  than  expert rd 
relative  frequencies  of  such  sentences  Sentences  were  n- >t  fil 
tered  on  the  basis  of  “grammat  irality" .  and  therefore  include, 
for  example,  instances  of  the  deletion,  lack  of  number  agreement 
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between  subject  and  verb,  and  mativ  rases  of  ellipsis  (i  e,  omission 
of  words  required  for  strict  grammaticahty  but  not  for  compre- 
hens  n.  as  in  the  deletion  of  the  second  instance  of  speed  in  Is 
the  i  r*  s  speed  greater  than  the  Ajax's  speed. 


2.2  Vocabulary 

The  vocabulary  was  determined  by  collecting  all  words  in  the 
1 200  initial  resource  management  sememes  If  eventual  users  are 
expected  to  stay  within  the  defined  vocabulary,  it  should  he.  in 
some  sense,  grammatically  logically  and  semantically  complete 
Therefore,  words  were  added  so  that  the  vocabulary  included  ( 1 ) 
both  singular  and  plural  forms  of  nouns,  (?)  words  required  for 
all  cardinal  numbers  less  than  a  million,  (3)  words  requir'd  for  all 
ordinals  needed  for  dates.  (  l)  infinitive,  present  and  past  partici¬ 
ple  verb  forms,  (a)  all  months  and  days  of  the  week  In  addition 
items  were  added  for  semantic  "Completeness"  for  example, 
since  high  occurred,  low.  high' t  highest,  lower,  and  lowest  were 
added  The  vocabulary  was  then  completed  bv  adding  enough 
open  class  items  to  rover  33  ports,  26  other  land  locations.  2*1 
bodies  of  water,  and  10*1  ship  names  (in  both  nominative  and 
possessive  forms) 

Since  these  sentences  were  to  be  read  bv  naive  subjects 
not  familiar  with  the  task  domain  or  the  database,  the  vocab¬ 
ulary  was  revised  some  open  class  items  were  replaced  with 
others  thought  to  be  easier  to  pronounce  I  Sea  of  Inpan  for  Sea 
of  Okhotsk),  and  spellings  of  some  technical  terms  were  changed 
to  clarify  the  pronunciation  (  TASSEM  for  the  acronym  TASM) 

2.3  Sentence  Materials 

The  1200  initial  resource  management  sentences  had  some 
disadvantages  they  included  many  slight  variations  of  the  same 
sentence  (e  g  ,  only  a  ship  name  changed  or  the  deleted),  and  the 
vocabulary  items  were  not  evenly  represented  (the  naval  person 
nel  interviewed  tended  to  use  only  one  or  two  ship  names,  for 
example,  in  all  their  examples)  Further,  we  felt  that  far  more 
than  1200  sentences  would  be  needed  to  represent  the  vocab¬ 
ulary  items  and  phonetic  contexts  of  the  task  Therefore,  the 
initial  1200  sentences  were  reduced  to  a  set  of  950  unique  surface 
semantic-syntactic  patterns  that  were  then  used  to  generate  2800 
sentences  with  excellent  coverage  of  the  vocabulary  items 

The  replacements  included  the  replacement  of  instances  of 
specific  ship  names  with  the  variable  shipname  ,  and  of  main 
instances  of  the  with  the  variable  optlhe  (to  indicate  optional 
the)  About  300  such  variables  (indicated  here  by  square  brackets 
to  distinguish  them  from  vocabulary  items)  were  defined  and  and 
used  to  replace  specific  instances 

In  the  two  following  examples,  included  to  give  an  idea  of  the 
degree  of  abstraction  involved,  the  variable  definitions  are:  what- 
is:  =>  what  is.  what’s.  ; shipnamr’s !  =>  Kirk's.  Fox's,  etc  .  gross- 
average}  gross ,  average,  [long- metric  =>  long,  metric,  show- 
list'  show,  list,  show  me.  etc  ;  ships  >  carriers,  cruisers,  etc  . 
}  water-place)  re-  Indian  Ocean.  Sea  of  Japan,  etc  .  dale  =>  March 
fth,  2  June  198 7,  etc 

1  :  what- is  optlhe''  \ shipname  s  gross-average  displacement 

2  show-list  optthf  ships  in  wit' r- phi.  r  .lit. 


After  replacement  of  instances  with  variables  in  tin-  I2ii(l 
sentences,  duplicates  were  removed,  yielding  '>30  sentence  pat 
terns  The  patterns  were  ordered  such  that  those  with  the  most 
unique  Wolds  of  classes  appealed  first  i It  the  list 

I  he  9S0  sentence  patterns  generated  280(1  sentences  in  three 
passes  of  substitution  of  an  instance  fir  each  variable  counter 
associated  with  each  variable  determined  which  instance  should 
he  used  f.r  each  substitution  1  he  patterns  thus  generated  a  set 
•  if  sentences  that  systematically  covered  the  vocabulary  items 
After  removal  of  duplicates,  there  were  2835  sentences  flic  35 
longest  sentences  were  removed;  the  remaining  28011  were  hand 
edited  to  remove  infelicities  that  could  arise  from  the  procedure 
(such  as  one  carriers  generated  from  i cardinal,  'ships  )  t  he  first 
600  sentences  generated  were  designated  training  sentences,  the 
ordering  of  the  patterns  and  the  generation  procedure  resulted  m 
good  coverage  of  the  vocabulary  these  600  sentences  cover  9  FT 
of  the  vocabulary  items 

In  between  the  concept  of  speaker- independence  (requiring 
no  new  data  from  new  speakers)  and  speaker-dependence  (requir¬ 
ing  a  great  deal  of  data  from  each  new  speaker)  is  the  concept  of 
speaker  adaptation  (requiring  a  small  amount  of  data  from  each 
new  speaker)  For  use  in  speaker-adaptation  technologies  we  have 
provided  10  “rapid  adaptation"  sentences,  designed  to  provide  a 
broad  and  representative  sample  of  the  speaker's  production  of 
phonemes  and  phoneme  sequences  of  the  2800  resource  manage¬ 
ment  sentences.  The  goal  was  to  provide  embedded  sets  of  one. 
two.  five  and  ten  sentences  that  each  had  the  best  coverage  (for 
its  sire)  of  the  relevant  phonemic  material  lhus.  the  first  is  the 
best  adaptation  sentence,  the  second  sentence,  when  added  to 
thi'  first,  is  the  best  combination  of  two  sentences  according  to 
the  same  coverage  criteria,  and  so  on  up  to  ten 

A  coverage  score  was  ralculated  for  each  phoneme  and 
phoneme  pair  in  a  sentence  based  on  the  observed  frequency  of 
the  phoneme  or  phoneme  pair  in  the  ?300  sentences,  but  breadth 
of  coverage  was  promoted  by  dividing  the  observed  frequency  of 
each  phoneme  nr  phoneme  pair  by  a  factor  (we  used  .3  0)  earh 
time  it  w, .1  used  in  the  material  currently  having  a  score  calm 
lated  In  order  to  inhibit  the  tendency  for  the  longest  (and  most 
difficult  to  read)  sentences  from  being  selected,  we  normalized  by 
dividing  the  srore  by  sentence  length  The  resulting  adaptation 
sentences  are  listed  in  the  appendix. 

For  the  '‘spell-mode"  utterances,  600  words  were  selected 
from  the  1000  vocabulary  items;  the  400  words  not  selected  were 
inflected  variants  of  those  chosen. 

3  Subject  Selection  and  Recording 

3.1  Subject  Selection 

On  t lie  basis  of  demographic  and  phonetic  characteristics. 

1 60  subjects  were  selected  from  a  set  of  630  adults  who  had 
participated  in  an  earlier  database  effort  |2l.  These  630  native 
speakers  of  English  (70%  male,  30%  female)  with  no  apparent 
speech  problems  formed  a  relatively  balanced  geographic  sample 
"f  the  (  nit*  d  States  As  a  group,  the  subjects  were  young,  well 
educated  and  White  m  their  tvvenltes.  78%  with  a  bar in 

I'-rs  degree  and  1%  [Mark  l.arh  speaker  was  identified  with  -me 
•f  ‘Uglit  g* .  -graphic  regions  -.f-.rigm  New  f.ngiand.  New  V  rk 
V*r*h*rn.  V.rth  Midland.  South  Midland.  Southern,  W  est  •’  r  n 
t  \rmv  brat  (people  wh>*  moved  around  a  lot  while  grownui 
,J P  ) 
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A  fining  other  material,  each  "f  these  fi.1i)  subjects  had 
recorded  two  dialect  shibboleth  sentences  (i  e  ,  sentences  contain 
mg  several  instances  of  words  regarded  as  a  criterion  for  distin 
gtiislung  members  of  dialect  groups)  These  sentences,  included 
in  the  appendix,  were  hand  transcribed  and  used  to  derive  a  pho¬ 
netic  profile  of  each  speaker  as  to  phonology,  voice  quality,  and 
manner  of  speaking  The  630  speakers  were  automatically  di¬ 
vided  into  20  clusters  according  to  their  pronunciation  of  sev¬ 
eral  consonants,  speaking  rate,  F0,  and  phonation  quality  from 
these  630  speakers  (now  identified  by  phonetic  cluster,  geographic 
origin  and  demographic  characteristics)  160  were  selected  for  the 
speaker-independent  part  of  the  database,  and  12  for  the  speaker- 
dependent  part 

The  160  speaker-independent  subjects  were  chosen  to  sat 
isfv  the  following  constraints,  in  order  I)  even  distribution  of 
subjects  over  four  geographic  regions  (NE-NY.  Midland,  South, 
North-West  or  Army)  and  over  the  20  clusters  derived  from  ob¬ 
served  phonetic  characteristics,  2)  70 %  male,  30%  female  These 
constraints  are  satisfied  in  the  subject  selection,  and  each  major 
division  of  the  database  (training,  development  test  and  evalu¬ 
ation  test)  have  similar  distributions  across  sex  and  geographic 
origin 

The  12  speaker-dependent  subjects  were  chosen  to  satisfy  the 
following  constraints  1)  representation  of  each  of  the  12  largest 
phonetic  clusters;  2)  seven  male,  five  female;  and  3)  geographical 
representation  as  follows  one  each  from  New  York  and  New 
England,  and  two  each  from  Northern,  North  Midland,  South 
Midland,  Southern,  and  Western.  Of  the  12  selected  speakers, 
11  were  from  the  speaker-independent  part  of  the  database,  and 
all  were  relatively  fluent  readers  with  no  obvious  speech  problems 

3.2  Subject-Sentence  Assignment 

Both  the  speaker-independent  and  speaker  dependent  parts 
of  the  database  are  divided  into  sets  for  training,  development 
test  and  evaluation  test 

In  the  speaker-independent  training  part  of  the  database,  80 
speakers  each  read  57  sentences  (10  resource  management  sen¬ 
tences,  the  2  dialect  sentences,  and  15  spell-mode  phrases)  ifiOO 
distinct  resource  management  sentences  were  covered  m  this  part 
of  the  database,  anv  given  sentence  was  recorded  bv  two  subjects 
The  distribution  of  sentences  to  speakers  was  arbitrary,  except 
that  no  sentence  was  read  twice  by  the  same  subject  Each  of 
the  80  speakers  read  15  spell  mode  phrases,  yielding  1200  pro¬ 
ductions  covering  30t|  unique  words.  L.ailt  spell  mode  phrase  ill 
this  part  was  read  bv  4  speakers 

In  the  speaker  independent  development  test  set  and  eval 
nation  test  set.  in  speakers  each  read  30  resource  management 
sentences,  tin  2  dialect  sentences,  the  in  rapid  adaptation  sen 
Imres,  and  15  spell- mode  phrases  fiqu  resource  management 
sentences  were  randomly  selected  for  eai  h  test  and  assigned  to 
the  1200  available  productions  ((0  speakers  times  30  sentences), 
yielding  two  productions  per  sentence,  as  in  the  training  phase 
Similarly,  in  each  test  set.  150  spell  mode  phrases  were  selected 
and  assigned  to  the  600  available  spell-mode  productions 

The  following  table  illustrates  the  structure  of  tlm  speaker 
j  dependent  part  of  the  database  The  numbers  indicate  how 
many  sentences  each  subject  read  file  total  number  of  resouri  e 
management  sentences  covered  bv  each  subset  of  I  lie  database 


is  indicated  in  parentheses  These  are  referred  to  as  'types''  in 
the  table  in  distinction  in  sentence  tokens,  or  productions  In  a 
particular  speaker  In  all.  for  the  speaker-independent  database. 
0120  sentences  were  ire-  .rded  (  1560  for  I  raining.  22*30  f. , r  del  flop 
ment  test ,  and  2280  for  evaluation  test )  Note  that  this  being  the 
speaker  independent  database  portion,  the  training  subjects  do 
not  overlap  with  those  in  the  lest  parts  of  the  database 
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For  the  speaker-dependent  training  portion  of  the  database, 
each  of  12  subjects  read  the  600  resource  management  train¬ 
ing  sentences,  the  2  dialect  sentences,  the  10  rapid  adaptation 
sentences,  and  a  selection  of  100  spell-mode  phrases  The  1200 
spell-mode  readings  covered  300  word  types,  with  4  productions 
per  word. 

In  the  speaker-dependent  test  portion  of  the  database,  these 
same  12  speakers  each  read  100  resource  management  sentences 
for  the  development-test  part  of  the  database  and  another  100 
resource  management  sentences  for  the  evaluation-test  part,  as 
well  as  50  spell- tin  ule  phrases  Front  the  2200  resource  man¬ 
agement  sentences  not  read  in  the  training  phase,  two  random 
selections  of  600  sentences  were  made,  one  for  the  development 
lest  and  one  for  the  evaluation  lest  portion.  Distributing  Ihese 
over  the  productions  available  in  each  gives  2  utterances  per  sen 
tence  Similarly,  two  random  selections  of  150  wo  ds  each  were 
made  from  the  pool  of  600  spell-mode  phrases  for  the  dev  |.>p 
mrnl  and  evaluation  lest  sets  Distributing  tlirse  over  the  6»Mi 
readings  available  yields  I  productions  per  word 

I  lie  following  table  illustrates  the  structure  of  the  speaker 
dependent  part  of  the  database  Again,  the  total  number 
dilferent  resource  management  sentences  (  "types  )  covered  in 
each  subset  is  indicated  ill  parentheses  alter  I  lie  nnmlior  mill 
rating  liow  mam  sentences  were  read  bv  each  subject  In  all.  ft 
the  speaker-dependent  database.  12,1  II  utterances  were  recorded 
(85  1 1  for  t raining.  1800  for  development  test,  and  1 8UU  0t  evalu¬ 
ation  test).  As  is  appropriate  for  a  speaker-dependent  database, 
the  speakers  in  the  training  set  are  the  same  as  the  speakers  in 
the  test  set 
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3.3  Recording  Procedure 

The  utterances  were  digitally  recorded  in  a  sound  isolated 
recording  booth  on  two  tracks  one  from  a  Sennheiscr  HMD41 1 
headset  noise-cancelling  microphone,  and  the  other  from  a  B&'K 
4165  one-half  inch  pressure  microphone  positioned  30  cm  from 
the  subject's  lips,  off-center  at  a  20  degree  angle  The  material 
was  digitized  at  20,000  16-bit  samples  per  second  per  channel, 
and  then  down-sampled  to  16,000  kHz 

Prompts  appeared  in  double-high  letters  on  a  screen  for  the 
subject  to  read.  After  the  recording,  both  the  subject  and  the 
director  of  the  recording  session  listened  to  the  utterances  and 
re-recorded  those  with  detected  errors  Any  pronunciation  con¬ 
sidered  normal  by  the  subject  was  accepted 

4  Database  Availability  and  Use 

This  database,  which  is  intended  for  use  in  designing 
and  evaluating  algorithms  for  speech  recognition,  is  being  made 
available  to  provide  ( 1 )  a  carefully  structured  research  resource, 
and  (2)  benchmarks  for  performance  evaluation  to  judge  both 
incremental  progress  and  relative  performance 

At  present  onlv  the  data  from  the  Sennheiser  microphone 
is  available  This  material  alone  amounts  to  approximately  930 
Megabytes  (MB)  of  data  for  the  speaker-dependent  subset  and 
640  MB  for  the  speaker-independent  subset,  with  an  additional 
460  MB  included  in  the  spell-mode  subset.  I  lie  down  sampled 
(16  kHz)  data  m  Unix  'tar’’  format  (6250  hpi)  can  he  made 
available  on  a  loan,  copy  and  return  basis 

To  provide  benchmark  test  facilities,  a  s.  t  ,,f  procedures  and 
a  uniform  scoring  software  package  have  been  developed  at  the 
National  Bureau  of  Standards  (NBS)  [lie  scoring  software  im¬ 
plements  a  dynamic  prograrmimig  string  alignment  on  the  ortho¬ 
graphic  representations  for  the  reference  sentences  and  for  the 
system  outputs.  Comparable  scoring  necessitated  agreement  on 
a  standard  orthographic  representation  for  each  vocabulary  item 
1  he  scoring  software  and  testing  procedure  are  being  used  in  the 
DARPA  program  for  performance  evaluation,  and  are  available 
to  the  general  public  on  request  U 

For  those  organizations  wishing  to  determine  and  report  per¬ 
formance  data  corresponding  to  that  reported  bv  DARI’A  pro¬ 
gram  participants.  NBS  can  provide  test  material  used  in  DARPA 
benchmark  tests  i 4 ■  If  the  results  are  to  hr  publicly  reported, 
it  is  required  that  the  summary  statistics  he  obtained  using  the 
NBS  scoring  software,  and  that  copies  of  system  output  for  these 
tests  be  made  available  to  NBS 

5  Conclusion 

For  DARPA  program  participants,  this  database  has  proven 
useful  in  the  design  and  evaluation  of  speaker  independent, 
speaker-adaptive,  and  speaker-dependent  speerh  recognition 
technologies;  we  hope  it  will  be  useful  to  others  as  well  Sim¬ 
ilarly,  the  methods  devei "ped  for  us  design  and  collection  should 
prove  useful  in  the  development  of  similar  databases 

We  have  described  the  characteristics  of  the  DARPA  inno- 
word  resource  management  database  the  task  domain,  the  vo¬ 
cabulary.  the  sentence  materials,  the  subjects,  the  division  into 


training  and  testing  portions  We  have  also  described  the  steps 
involved  in  creating  this  database,  including  the  recording  pro¬ 
cedure  and  new  methods  for  designing  the  vocabulary  and  sen¬ 
tence  set,  speaker  selection,  and  distribution  of  sentence  materi¬ 
als  among  the  speakers  In  addition,  we  have  outlined  procedures 
for  obtaining  the  database  and  for  using  it  as  a  benchmark  Fur¬ 
ther  details  on  each  of  these  areas  will  be  made  available  with 
the  database. 
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APPENDIX 

Dialect-Shibboleth  Sentences 

1  She  had  your  dark  suit  in  greasy  wash  water  all  vear 

2  Don  I  ask  me  to  carry  an  oilv  rag  like  that 

Rapid  Adaptation  Sentences 

1  Show  locations  and  C-ratings  for  all  deployed  subs  that  aere 
in  their  home  ports  April  5 

2  List  the  cruisers  in  Persian  Sea  that  have  casualty  reports 
earlier  than  Jarrett  s  oldest  one 

3  Display  pouts  for  the  hooked  track  with  chart  switches  set  to 
their  default  values 

4  What  IS  England's  estimated  time  of  arrival  at  Townsville’ 

5  How  many  ships  were  in  Galveston  Mav  3rd’ 

6  Draw  a  chart  centered  around  Fox  using  stereographic  projer 
tion 

i  How  many  long  tons  is  the  average  displacement  of  ships  in 
Bering  Strait’ 

i  What  vessel  wasn't  downgraded  on  training  readiness  durinit 
Julv’ 

9  Show  the  same  display  increasing  l<*t ter  size  to  the  maximum 

value 

in  Is  Puffers  remaining  fuel  sufficient  lo  arrive  in  porl  at  if-c 
present  speed’’ 
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A  HSTIIACT 

Statistical  language  models  has  been  successfully  used  to 
improve  performance  of  continuous  speech  recognition  al¬ 
gorithms.  Application  of  such  techniques  is  difficult  when 
only  a  small  training  corpus  is  available.  This  paper 
presents  an  approach  for  dealing  with  limited  training  avail¬ 
able  from  the  DARPA  resource  management  domain.  An  ini¬ 
tial  training  corpus  of  sentences  was  abstracted  by  replacing 
sentence  fragments  or  phrases  with  variables.  This  training 
corpus  of  phrase  sequences  was  used  to  derive  parameters 
a  Markov  model.  The  probability  of  a  word  sequence  is 
then  decomposed  into  the  probability  of  possible  phrase  se¬ 
quences  and  the  probabilities  of  the  word  sequences  within 
each  of  the  phrases. 

Initial  results  obtained  on  150  utterances  from  six  speak¬ 
ers  in  the  DARPA  database  indicate  that  this  language  mod¬ 
eling  technique  has  potential  for  improved  recognition  per¬ 
formance.  Furthermore,  this  approach  provides  a  frame¬ 
work  for  incorporating  linguistic  know  ledge  into  statistical 
language  models. 


1  Introduction 

This  paper  addresses  the  use  of  statistical  language  mod¬ 
eling  techniques  in  continuous  speech  recognition  in  the 
DARPA  1000-word  naval  resource  management  application 
domain  '5j.  This  application  involves  the  recognition  of 
“natural"  speech  queries  to  an  interactive  database  system. 
As  will  be  discussed  below,  the  “language"  which  will  be 
used  is  unknown  and  a  large  training  corpus  is  not  available. 
Straightforward  application  of  statistical  language  model¬ 
ing  techniques  is  therefore  difficult.  However,  a  language 
model  is  required  to  obtain  very  good  recognition  perfor¬ 
mance. 

Language  models  provide  a  way  of  assigning  likelihoods 
to  word  sequences  in  a  language.  The  combination  of  such  a 
measure  with  a  measure  of  the  acoustic  likelihood  of  a  word 
sequence  has  been  shown  to  give  good  recognition  perfor- 

'Tlns  rcsearrh  was  supported  by  the  Defence  \dvanccd  Re¬ 
search  Projects  \gency  under  contract  N0n039-<*5-(MM23  moni¬ 
tored  by  STAWsn 


inance  in  many  applications.  Several  approaches  have  been 
successfully  employed  for  languages  of  various  complexity 
and  various  sizes  of  training  corpus  (for  example  (2|). 

in  certain  restricted  domains,  finite  state  grammars  have 
been  used  with  considerable  success  (see  Jdj  for  example) 
In  this  case,  the  likelihood  of  a  word  sequence  is  a  binary 
decision  —  a  sequence  is  either  parsed  in  the  grammar  or  it 
is  not  in  the  allowable  language.  The  extent  to  which  Die 
actual  word  sequences  in  the  application  arc  parsed  by  the 
grammar  is  termed  coverage.  When  the  language  is  known 
and  not  complex,  the  coverage  is  generally  high  and  the 
constraints  are  well  modeled  by  the  grammar. 

In  the  case  of  large  vocabularies  (>  1000  words)  and  “nat¬ 
ural"  language  input  one  approach  taken  is  the  specification 
of  formal  grammars  which  describe  the  syntactic  and  se¬ 
mantic  constraints  of  the  domain  [6|.  The  important  issue 
is  then  the  extent  to  which  this  grammar  provides  suffi¬ 
cient  coverage  while  ruling  out  invalid  word  sequences.  It 
has  been  found  that  it  is  difficult  to  arhieve  a  high  degree 
of  coverage  however.  Recognition  performance  is  generally 
high  on  sequences  parsed  by  the  grammar.  Howevei .  when 
coverage  of  the  valid  word  sequences  is  not  high,  then  the 
language  model  actually  introduces  errors  by  not  allowing 
valid  word  sequences. 

To  overcome  the  performance  constraints  imposed  by- 
poor  roverage,  statistical  language  models  can  be  used. 
When  a  large  training  rorpus  is  available,  the  parameters 
of  a  statistical  language  model  ran  be  determined.  To  the 
extent  that  the  training  corpus  is  representative  of  the  real 
application,  such  techniques  provide  good  performance  ill. 
Furthermore,  since  no  binary  decision  as  to  the  validity  of 
a  word  sequence  is  necessary,  the  method  is  less  “brittle’ 
than  the  formal  grammar  techniques. 

In  the  domain  of  interest  in  this  paper,  the  language  is 
not  sufficiently  well  defined  to  allow  the  use  of  a  finite- 
state  grammar  which  both  captures  the  constraints  of  the 
domain  and  is  of  reasonable  size.  Furthermore,  there  is  no 
adequate  training  corpus  for  construction  of  .*  straightfor¬ 
ward  statistical  model  to  characterize  the  word  sequences. 
Due  to  the  na'ural  language  interface,  a  grammar  describ¬ 
ing  the  complete  language  is  very  complex.  Also,  it  is  diffi¬ 
cult  to  evaluate  the  extent  to  which  any  partic  ilar  grammar 
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Figure  1:  Word  sequence  model 


covers  sentences  of  the  ultimate  application  domain.  The 
complexity  of  the  language  suggests  a  statistical  approach. 
However,  since  the  application  does  not  vet  exist,  a  truly 
representative  training  corpus  is  not  available.  Further¬ 
more,  we  feel  that  due  do  heavy  use  of  jargon  and  unusual 
sentence  structure,  any  attempt  to  use  a  training  corpus 
from  another  domain,  such  as  general  English  text,  would 
be  ineffective. 

The  approach  described  in  this  paper  attempts  to  incor¬ 
porate  some  linguistic  knowledge  of  the  structure  of  the  lan¬ 
guage  into  a  probabilistic  framework.  Using  this  approach, 
we  will  show  very  good  performance  can  be  obtained  when 
the  algorithm  is  evaluated  on  sentences  which  are  indepen¬ 
dent  of  those  used  in  construction  of  the  statistical  model. 

In  the  next  section,  the  basic  structure  of  the  model  is 
described  followed  >y  a  description  of  the  training  method 
employed.  In  Section  3.  the  results  on  six  speakers  from  the 
DARPA  database  are  presented.  Finally,  Section  4  contains 
a  short  discussion  and  concluding  remarks 

2  Approach 

2.1  Language  Model  Structure 

The  principle  goal  in  the  design  of  the  probabilistic  lan¬ 
guage  model  is  to  allow  the  estimation  of  robust  model 
parameters  from  the  modest  training  corpus  which  is  avail¬ 
able.  A  Markov  model  used  to  generate  word  sequences 
directly  has  too  many  parameters  (the  transition  proba¬ 
bilities)  to  be  estimated  reliably  from  the  limited  training 
corpus.  Uy  considering  a  simpler  model,  which  has  fewer 
parameters  associated  with  it.  robust  estimates  might  be 
obtainable.  Furthermore,  some  linguistic  structure  can  be 
identified,  and  this  structure  is  incorporated  into  the  model. 

The  model  for  the  generation  of  a  word  sequence  is  com¬ 
posed  of  two  part  (Figure  1).  First,  a  sequence  of  phrase 

variables  r,,  c- . is  generated  as  a  Markov  chain.  Then. 

for  each  phrase  c,  a  sequence  of  words  1  is  generated, 
independent  of  the  phrases  c;,  j  t  t.  The  probability  of  a 
phrase  sequence  c^  cj.  . . .  c.v  is 

Pr  (c, . cvl  = 

Pr(c,)  Pr  (r:'r,)  Pr  f<-\  r, . rv  ,) 


The  probability  of  the  phrase  sequence  and  the  word  se¬ 
quence  tt>i,  w 2,  ...,  !(•„  is  then 

Pr  (ct| . cN,w, . w„)  = 

n  Pr  (u>(,)  I  c.)  Pr(c,  I  c, . c,_,) 

,wlWI  i  =  t 

where  the  sum  is  effectively  over  the  possible  segmentations 
of  the  word  sequence  into  the  phrases.  Note  that  since  any 
to1'1  might  be  a  null  expansion  of  a  phrase,  this  represen¬ 
tation  of  the  probability  in  fact  has  an  infinite  number  of 
terms. 

Using  this  structure,  we  identify  phrases  based  on  syntac¬ 
tic  and  semantic  components  of  the  language.  For  exam¬ 
ple.  typical  phrases  include  “open"  set  classes  such  as  ship 
names  or  complex  expressions  such  as  dates.  Also,  lo  com¬ 
plete  the  coverage  of  the  language,  single  word  phrases  are 
also  allowed.  Associated  with  each  phrase  is  a  small  (ini'e 
state  grammar  describing  all  possible  ways  that  a  phrase 
can  be  expanded. 

The  parameters  of  the  Markov  phrase  model  are  derived 
from  the  training  corpus.  The  probabilities  Pr  (lr^’y, )  as¬ 
sociated  with  the  transformation  of  phrases  into  word  sub¬ 
sequences  are  assigned  a  priori.  In  this  way.  a  small  train¬ 
ing  corpus  ran  be  used  to  estimated  the  smaller  number 
of  parameters  of  the  Markov  model  without  sacrificing  the 
robustness  of  the  overall  model. 

2.2  Corpus 

In  the  resource  management  application  domain,  the  initial 
training  corpus  consists  of  approximately  1200  sentences  on 
a  vocabulary  of  about  1000  woids  which  arc  thought  to  be 
representative  of  the  domain.  These  sentences  were  gener¬ 
ated  attempting  to  simulate  the  interaction  of  a  person  with 
the  interactive  database  system.  This  database  is  further 
described  in  5|  in  these  proceedings. 

From  these  initial  sentences,  a  set  of  approximately  loot) 
sentence  patterns  were  generated.  This  process  was  car¬ 
ried  out  manually.  The  goal  was  to  incorporate  linguistic 
knowledge  by  replacing  syntactically  and  semantically  simi¬ 
lar  components  of  the  sentences  with  phrase  identifiers  I  "f 
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example.  a  lypic.il  sentence  and  its  corresponding  paLtern 
is 

What,  gas  surface  ships  which  are  in  Coral  Sea  are 
SLQ-32  capable 

=>  t vhat  [ prop-type ]  surface  U'esse/sl  \optthat-arc\ 
in  |  water-place]  are.  [capaAi/iit/j  capable 

A  phrase  such  as  |opt£ha£-arcj  can  be  expanded  into  the 
finite  state  grammar 

\opUhat-are\  — >  (empty  string) 

- >  which  are 

— •  that  are 

For  each  experiment,  these  patterns  were  partitioned  into 
a  training  and  testing  set.  The  testing  set  was  not  used  in 
the  estimation  of  the  model  parameters.  The  test  sentences 
were  generated  from  the  test  patterns  by  expanding  the 
phrases  into  word  sequences. 

2.3  Parameter  Estimation 

For  each  speaker,  a  set  of  900  training  patterns  was  cho¬ 
sen  which  was  disjoint  of  the  patterns  of  the  test  sentences. 
A  first  order  Markov  model  was  constructed  based  on  the 
training  patterns  (the  patterns  included  the  context  of  the 
sentence  initial  and  sentence  final  boundary  markers).  The 
transition  probabilities  were  obtained  from  the  relative  fre¬ 
quencies  of  phrases  pairs  in  the  training  patterns,  using 
a  simple  interpolation  rule  to  incorporate  part  of  the  ze¬ 
roth  order  distribution.  Interpolation  is  used  to  overcome 
limitations  of  insufficient  training  by  assigning  reasonable 
nonzero  probabiiitics  to  all  event.  Specifically,  if  F(c,ic,.|) 
is  the  relative  frequency  of  c,  following  c,,|  and  F(c,)  is  the 
relative  frequency  of  c,  then  probability  of  a  phrase  e,  is 
assumed  to  be 

P r  ( c, I . . _ , )  =  \F(c,\c,-i)  (1  -  A)F(c,) 

where  in  these  experiments  A  =  0.9  for  all  states.  For 
each  grammar  associated  with  a  phrase,  a  simple  assump¬ 
tion  that  all  possible  word  sequences  are  equally  likely  was 
made.  Specifically.  iT  there  are  m  different  non-null  expan¬ 
sions  of  a  phrase  c,  then  each  of  these  expansion  nq,  . . . 
is  assigned  a  probability 

Pr  (u>i . urt|c)  =  (1  -(?,)  — 

m 

where  0,  is  the  probability  of  a  null  expansion.  For  noil- 
optional  phrases,  0C  =  0. 

2.4  Decoding  Method 

The  decoding  algorithm  used  to  generate  the  results  is 
based  on  the  algorithm  presented  in  ! 2,3 1 .  A  hidden  Markov 
model  approach  is  taken  in  which  context-dependent  1  ri- 
phone  models  are  trained  using  I  lie  “forward-backward” 


algorithm.  Whole  word  models  are  const riicted  bv  con¬ 
catenation  of  interpolated  (by  context)  triphonc  models. 

The  statistical  language  model  described  above  is  com¬ 
bined  with  these  word  models.  Conceptually,  each  tran¬ 
sition  in  the  Markov  phrase  model  is  replaced  by  a  net¬ 
work  representation  of  the  sub-grammar  associated  with 
the  phrase  (with  branching  probabilities  at  each  of  the 
nodes).  Each  arc  in  the  grammar  is  replaced  by  the  hidden 
Markov  model  for  the  word  associated  with  the  arc.  There¬ 
fore,  the  entire  model  can  be  thought  of  a  one  large  hidden 
Markov  model. 

The  decoding  algorithm  attempts  to  find  the  maximum 
likelihood  phrase  sequences  c\,  ...  c/v  and  the  word  expan¬ 
sions  iid'l  of  each  phrase.  The  output  word  sequence  is  then 
the  concatenation  of  the  ui(l*. 

3  Results 

Initial  experiments  were  conducted  on  a  speaker  not  in¬ 
cluded  in  the  DARPA  database  in  order  to  determine  suit¬ 
able  system  parameters  (which  were  then  unchanged). 

3.1  Test  on  Training 

Before  evaluation  on  the  independent  test  sets,  two  speak¬ 
ers  were  run  using  sentences  derived  from  patterns  in  their 
training  sets.  As  expected,  the  perplexity  Q 1  for  the  sta¬ 
tistical  model  is  very  low  in  this  case  and  recognition  word 
error  rate3  is  small.  As  shown  in  Table  I  this  demonstrate* 


:j  test  on  training 

speaker 

;  MP 

(<?) 

[  wp 

IQ) 

dtb 

5.4% 

42.5  j 

\T.M 

(69.8)  ; 

I  <  p  dvy 

|j  4.5  ,o 

40.3 

5.9% 

(53.3)  | 

Table  1:  Word  error  rate  on  training  set  (MP  =  Markov 
phrase  model:  WP  =  word  pair  grammar) 

that  when  evaluated  on  the  training  set  such  a  statisti¬ 
cal  model  give  low  perplexity  and  good  recognition  perfor¬ 
mance.  For  comparison,  results  using  a  grammar  (WP)  is 
shown.  This  grammar  is  constructed  to  allow  all  two-word 
sequences  which  occur  in  any  expansion  of  the  training  pat¬ 
terns.  Note  that  even  though  the  statistical  model  used 
incorporates  the  interpolation  rule  described  above,  and 
therefore  allows  all  possible  word  sequences  and  not.  sim¬ 
ply  those  in  the  the  WP  grammar,  the  perplexity  is  lower 

2  Perplexity  Q  -  2*  where  /  is  the  average  informal  ion 
(-  logjp)  of  the  stale  transitions  (with  probabilities  p)  in  a  set 
of  sentences  using  a  particular  probabilistic  model 

•'Word  error  rale  is  the  averaee  number  of  substitution  (5). 
deletion  (  D)  and  insertion  (/)  errors  per  reference  word  (—(.*'  • 
D  t-  fc')/,V  where  iV  is  the  number  of  reference  words) 
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than  the  WP  grammar  and  the  performance  is  somewhat 
better. 

3.2  Test  Results 

The  full  evaluation  consisted  of  six  speakers  from  the 
DARPA  database  with  25  utterance  each.  The  word  er¬ 
ror  .ates  ar  •  p resem'ed  !n  Table  2.  ’n  ord":  te  evaluate 


independent  test 

test  on  training  j 

speaker 

MP 

NC 

WP 

(Q  *  7S) 

(Q  =  1000)  ; 

(Q  *60) 

bef 

12.3% 

40.9%  i 

8.9%  j 

rmr 

13.8% 

39,6%  j 

9.3%  ; 

dtb 

11.8% 

39  1% 

5.1%  | 

dtd 

10.0% 

26.7% 

6.7%  j 

Pffh 

7.0% 

32.0% 

6.0%  ; 

tab 

6.3% 

21.8% 

3.2%  i 

ave. 

10.2% 

33.9% 

6.6% 

Table  2:  Recognition  word  error  rate  (MP  =  Markov  phrase 
model:  NG-=null  grammar:  VVP  =  word  pair  grammar) 

the  improvement  duo  to  the  statistical  language  model,  the 
word  error  rale  for  a  “null"  grammar  (NO)  in  which  all 
word  sequences  ran  occur  is  also  shown.  The  NC.  result 
is  a  measure  of  the  acoustic  difficulty  of  the  task.  The  re¬ 
sult  using  the  word-pair  (WP)  grammar,  trained  on  the 
training  and  testing  patterns  is  also  presented  in  order  to 
show  that  the  statistical  approach  achieved  almost  equal 
performance  without  the  loss  imposed  by  imperfect  cover¬ 
age.  Also,  note  that  the  perplexity  of  the  statistical  model 
(Q  =  75)  is  comparable  to  the  \VP  grammar  ( Q  sr  60)’ 
despite  the  fact  that  the  WP  perplexity  is  measured  on  a 
subset  of  its  training  sentences.  Finally,  in  order  to  evaluate 
the  effect  of  coverage  of  a  grammar  on  overall  performance, 
consider  the  sentence  error  rates  of  19%  for  the  statistical 
WP  case  and  36%  for  the  WP  grammar,  fn  order  for  the 
WP  grammar  to  achieve  19%  sentence  error  rate  including 
the  effect  imperfect  coverage,  at  least  80%  of  the  sentences 
would  have  to  parseY  Currently,  this  level  of  coverage  is 
not  available. 

The  results  presented  are  preliminary.  Several  aspects  of 
this  approach  have  not  been  investigated.  For  instance,  the 
structure  of  the  Markov  model  has  not  been  fully  explored. 
Though  some  experiments  have  been  performed  to  evaluate 

^Perplexity  on  the  WP  grammar  is  obtained  assuming  all 
branches  in  a  deterministic  finite  stale  n°twork  representation 
are  equally  likely 

^Suppose  a  fraction  of  sentences  r  parse  under  the  WP  gram¬ 
mar  Assuming  the  remainder  have  a  sentence  error  rate  of  55%. 
then  i he  overall  error  rate  would  be  ( I  -  i)  +  t)  36x  For  this  to 
be  less  than  19%.  I  >  80% 


the  use  of  certain  higher  order  states  whirh  have  been  ob¬ 
served  in  the  training,  it  is  not  clear  how  the  model  should 
be  constructed  to  actually  improve  recognition  performance 
significantly.  Also,  the  assumption  that  all  word  sequences 
within  a  grammar  are  equally  likely  is  clearly  a  very  crude 
approximation  and  some  improvement  may  be  obtainable 
through  more  careful  assignment  of  these  probabilities. 

4  Conclusions 

The  results  presented  here  demonstrate  the  viability  of  in¬ 
corporating  linguistic  structure  into  a  statistical  model.  In 
the  resource  management  domain,  neither  solely  statistical 
nor  linguistic  techniques  alone  are  adequate  at  this  lime. 
Straightforward  statistical  techniques  lack  sufficient  train¬ 
ing  and  linguistic  techniques  have  an  inadequate  coverage. 
However,  the  combination  of  the  modest  training  available 
and  simple  linguistic  absiractions  of  this  training  corpus 
provides  good  performance. 
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Abstract 

We  present  results  of  the  BB.N  BYBLOS  continuous  speech 
recognition  system  tested  on  the  DARPA  1000-word  re¬ 
source  management  database.  The  system  was  trained  in  a 
speaker  dependent  mode  on  28  minutes  of  speech  from  each 
of  8  speakers,  and  was  tested  on  independent  test  material 
for  each  speaker.  The  system  was  tested  with  three  artificial 
grammars  spanning  a  broad  perplexity  range.  The  average 
performance  of  the  system  measured  in  percent  word  error 
was:  1.49c  for  a  pattern  grammar  of  perplexity  9,  7.57c  for 
a  word-pair  grammar  of  perplexity  62.  and  32.47c  for  a  null 
grammar  of  perplexity  1000. 

1  Introduction 

A  meaningful  comparison  between  the  performance  of 
speech  recognition  algorithms  and  systems  can  be  made 
only  if  the  systems  have  been  tested  on  a  common  database. 
Even  with  common  testing  material,  comparative  results 
become  difficult  to  interpret  when  grammars  are  used  to 
constrain  the  recognition  search.  The  ambiguity  introduced 
by  the  use  of  grammars  can  be  overcome  by  reporting  re¬ 
sults  with  the  grammar  disabled,  which  would  establish  a 
baseline  acoustic  recognition  performance  for  the  system, 
and  by  using  standard  generally  available  grammars.  Fi¬ 
nally.  reporting  a  standard  measure  of  the  constraint  pro¬ 
vided  by  a  grammar  makes  the  results  more  meaningful. 

In  this  paper  we  report  results  for  the  BBN  BYBLOS 
system  tested  on  a  standard  database  using  two  well  de¬ 
fined.  artificial  grammars  and  with  an  unconstrained  null 
grammar.  The  database  has  been  developed  by  the  DARPA 
Strategic  Computing  Speech  Recognition  Program  for  the 
purpose  of  comparative  system  performance  evaluation  of 
continuous  speech  recognition  systems  6  . 

In  section  2.  we  describe  the  BYBLOS  system.  In  section 
3.  the  database  and  testing  protocol  are  discussed.  The 


grammars  used  in  the  experiments  are  described  in  section 
4.  Section  5  presents  the  recognition  system  results.  The 
results  are  discussed  in  section  6. 

2  The  BYBLOS  System 

The  BYBLOS  continuous  speech  recognition  system  2 
uses  discrete  density  hidden  Markov  models  (HMM)  of 
phonemes,  a  phonetic  dictionary,  and  a  finite  state  gram¬ 
mar  to  achieve  high  recognition  performance  for  language 
models  of  intermediate  complexity.  The  parameters  of  the 
HMMs  are  estimated  automatically  from  a  set  of  super¬ 
vised  training  data.  The  trained  phoneme  models  are  com¬ 
bined  into  models  for  each  word  in  the  dictionary.  These 
phonetic  word  models  are  then  used  to  compute  the  most 
likely  sequence  of  words  in  an  unknown  utterance.  A  for¬ 
mal  description  of  a  complete  HMM  system  is  presented  in 
1  . 

The  BYBLOS  system  has  been  designed  to  accomodate 
large  vocabulary  applications.  It  trains  a  set  of  phoneme 
models  which  requires  only  a  moderate  amount  of  speech 
to  adequately  observe  all  the  phonemes.  In  addition,  the 
system  trains  a  separate  model  for  each  distinct  context  in 
which  a  phoneme  is  observed.  A  phoneme's  context  can 
be  defined  by  its  adjacent  phonemes  or  the  word  in  which 
it  appears.  Context  modeling  captures  coarticulation  phe¬ 
nomena  explicitly  and  preserves  phonetic  detail  for  those 
contexts  which  occur  frequently  in  the  training  material  7  . 
By  combining  the  smoothed  phoneme  models  with  the  de¬ 
tailed  context  models.  BYBLOS  makes  maximal  use  of  the 
available  training  material.  The  performance  improvement 
gained  bv  using  context  dependent  phoneme  modeling  has 
been  reported  in  3  . 

After  training  is  completed,  the  dictionary  is  popu¬ 
lated  by  compiling  the  trained  phonetic  models  into  word 
networks.  A  finite  state  grammar,  if  used,  is  compiled 
from  a  formal  language  model  specification.  To  decode 


an  unknown  utterance.  B \  B L ( )S  utilize!-  the  precompiled 
knowledge  sources  '<unt:v  ;rt  a  tirne-svnchr- »n*Miv.  top-down 
search.  This  search  strategy  allows  efficient  pruning  and 
nunirru/es  local  decisions. 

lift)?  has  been  ciniu -i:st rat rd  in  a  speaker  dependent 
and  a  speaker  adaptive  mode.  Speaker  dependent  mod¬ 
eling  achieves  high  performance  by  estimating  the  model 
i  '  -’-s  f.uiu  .1  r uo.o.g  .  .  ..  I-  .w.-oi  is  ia,  ge  enough 
to  contain  most  of  the  Contexts  hkeiy  to  appear  in  sub¬ 
sequent  use  of  the  system.  1  he  speaker  dependent  mode 
Uas  been  used  to  achieve  the  results  reported  in  tins  pa¬ 
per.  The  speaker  adaptive  mode  modifies  the  weli  trained, 
speaker  dependent  word  models  of  one  speaker  to  model  a 
new  speaker.  This  technique  allows  the  system  to  benefit 
from  the  well  trained  word  models  of  a  prototype  speaker 
even  when  the  training  material  from  the  new  speaker  is 
extremely  limited.  The  adaptation  mode  of  the  BYBLOS 
system  is  discussed  in  -1.5  . 

3  Database 

The  database,  described  in  detail  in  6  .  was  designed 
to  provide  a  standard  for  research  in  speaker  dependent, 
speaker  adaptive,  and  speaker  independent  continuous 
speech  recognition.  The  database  was  designed  to  cover  the 
vocabulary,  syntax,  and  functionality  of  a  naval  resource 
management  task.  The  vocabulary  consists  of  1000  words. 
The  task  domain  covered  by  the  database  ts  specified  by  a 
set  of  950  sentence  patterns  which  were  used  to  generate 
the  2800  distinct  sentences  in  the  database. 

The  speaker  dependent  database  provides  600  sentences 
1  about  thirty  minutes  of  speech)  Designated  as  training  ma¬ 
terial  from  each  of  twelve  d>»lectallv  diverse  speakers,  col¬ 
lected  in  six  different  sessions.  The  scripts  for  the  training 
material  are  designed  to  maximize  coverage  of  the  vocab¬ 
ulary  and  sentence  patterns.  The  speakers  include  seven 
male  and  five  female  speakers.  Independent  test  material 
v.  as  collected  for  the  twelve  speakers  during  additional  ses- 
-l"ns. 

The  experiments  reported  in  this  paper  have  been  con¬ 
ducted  for  the  purpose  of  comparative  performance  evalu¬ 
ation  within  the  DARPA  community.  The  evaluation  was 
administered  by  the  National  Bureau  of  Standards  (NBS). 
For  the  speaker  dependent  portion  of  the  evaluation,  tests 
\  ere  conduct ed  tisir.g  eight  •  >:  the  t v.  e’.ve  available  speakers. 

\N  z  withheld  TO  sentences  from  the  training  material  for 
each  speaker  to  be  used  for  adjust-lne  global  system  parame¬ 
ters.  The  remaining  570  sentences  that  we  used  for  training 
include  952  unique  words  from  the  vocabulary.  Approxi¬ 
mately  57  of  the  w.irds  in  ’he  dictionary  are  not  observed 
at  ail  ;n  the  training  set.  367  occur  oniy  once,  and  497 


occur  more  than  once. 

Twenty  five  sentences  were  r“lected  by  N'BS  as  test  ma¬ 
terial  for  each  speaker.  The  test  sets  are  different  for  each 
speaker,  but  on  average,  each  set  contains  about  200  word: 
1  he  test  sentences  for  the  eight  speakers  cover  467  of  the 
dictionary.  917  of  the  word  tokens  occurring  in  the  eight 
test  sets  have  occurred  more  than  once  in  the  training  set 
illustrating  the  effectiveness  of  the  training  data  coverage 
over  the  task  domain. 

4  Grammars 

The  results  reported  below  have  been  run  using  three  differ¬ 
ent  grammar  conditions.  These  grammars  are  not  intended 
as  serious  models  of  the  task  domain,  but  are  used  bevause 
they  are  simply  defined  and  allow  the  system  to  be  tested 
over  a  broad  range  of  language  model  constraint. 

A  straight-forward  measure  of  the  constraint  provided 
by  a  grammar  is  test  set  perplezity  5  which  is  measured 
on  a  finite  state  network  generated  by  the  grammar  and 
a  given  set  of  test  sentences.  For  the  purpose  of  perplex¬ 
ity  measurement,  a  distinguished  symbol  designating  inter¬ 
sentence  silence  is  added  to  the  dictionary  and  to  the  end 
of  each  sentence  of  the  test  set.  The  augmented  sentences 
are  then  concatenated  and  appended  to  an  initial  Inter¬ 
sentence  silence  to  form  the  word  sequence.  U'l.U'j . u-„. 

If  the  word  sequence  is  sufficiently  long,  the  probability  of 

the  sequence  given  the  grammar.  Pfuq.u’j . u-v).  can  be 

used  to  compute  an  estimate  of  the  grammar  perplexity. 

The  perplexity  of  the  grammar,  given  the  test  set  word 
sequence,  is  defined  as: 

L  =  2K  (1) 

where 

1 

A  =  -(  -  i  >  loe,  P{  U',  u',-l . U.';  j  (21 

n  rr; 

is  the  average  per  word  entropy  of  the  language  model,  and 

P  u- . )  =  1  i  3  • 

For  the  grammars  used  in  these  experiments,  the  proba¬ 
bilities  on  the  words  allowed  by  the  grammar  at  position  ? 
in  the  test  set  word  sequence  are  assumed  to  be  uniform. 

The  three  grammars,  which  we  call  the  sentence  pattern, 
word-pair,  and  null  grammar,  allow  all  sentences  In  the 
training  and  test  databases.  The  sentence  pattern  grammar 
Is  compiled  directly  from  the  set  of  950  sentence  patterns 
covering  all  sentence  types  in  the  task  domain  6  .  The 
perplexity  of  the  pattern  grammar,  averaged  over  the  eight 
speakers'  test  sets,  is  9.  The  word-pair  grammar  allows  all 
two-word  sequences  allowed  in  the  sentence  pattern  gram- 
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exceed  100% 


mar.  Its  perplexity  ;s  about  62.  The  null  grammar  allows 
all  sequences  of  words  in  the  vocabulary  and  therefore  of¬ 
fers  no  language  model  constraint  The  effective  perplexity 
"f  the  null  grammar  is  equu.  to  i()H(l  —  the  vocabulary  size. 

5  RESULTS 

1  tie  svstent  parameters  for  these  experiments  were  derived 
tr  :r  twospeaxers  data  collected  at  HUN  and  limited  test¬ 
ing  -n  two  speakers  from  the  DARPA  database  (CMR  and 
HEP'  using  the  data  that  we  withheld  from  the  training 
set  The  system  configuration  was  then  fixed  for  the  entire 
set  of  experiments.  Each  speaker  was  tr..  ned  only  once. 

The  database  speech  was  colleced  at  Texas  Instruments 
(TI)  in  a  sound  isolating  booth.  For  these  experiments 
we  used  speech  sampled  at  20  kHz.  through  a  Sennheiser 
HMD-414,  close-talking,  noise-canceling  microphone.  14 
Mel-scale- warped  cepstral  coefficients  were  computed  every 
10  ms.  using  a  20  ms  data  window,  and  vector  quantized 
using  an  S-bit  codebook. 

%  Word  Error 


Figure  i :  Recognition  Performance  as  a  Function  of  Gram¬ 
mar  Perplexity.  The  axes  are  log  scale 

Figure  1  shows  recognition  performance,  averaged  across 
the  eight  speakers,  for  the  three  grammar  conditions.  The 
performance  is  given  in  percent  word  error: 

WORD  ERROR  =  100  x  {S  -  D  -  I)  S 
where: 

S’  =  number  of  substitution  errors. 

D  =  number  of  deletion  errors. 

/  =  number  of  insertion  errors. 

.V  =  total  number  of  word  tokens  in  the  test  sentences. 
This  measure  has  been  proposed  as  a  standard  within  the 
DARPA  community.  Note  that  since  the  number  of  inser¬ 
tion  errors  possibie  is  not  bounded,  this  error  measure  can 


A  word  hypothesis  is  counted  in  error  if  it  does  not  iden¬ 
tically  match  the  correct  word  transcription.  Specifically, 
homophones  (e  g.,  to,  two.  too:  or  ships,  ship's,  ships')  are 
counted  as  errors.  Homophone  errors  typically  occur  only- 
in  the  null  grammar  experiments  where  they  account  for 
approximately  4%  of  the  word  error  rate.  Furthermore,  no 
special  significance  is  given  to  errors  which  are  phonetically 
close  to  the  correct  answer  (minimal  pair  differences)  or  to 
errors  which  leave  the  semantic  interpretation  of  the  sen¬ 
tence  intact  (most  deletions  of  the  word  ‘the  ). 

Individual  results  for  each  speaker  are  shown  in  Tabie  1. 
Two  speakers,  CMR  and  DTD.  are  female.  The  results  are 
given  as  word  error,  defined  above,  and  as  word  correct: 

WORD  CORRECT  -  1UU  x  l-(5-D)  /  V 

where.  S.  D.  and  .V  are  defined  as  uefore. 

Note  that: 

U  ORD  ERROR  *  100  -  WORD  CORRECT. 

For  the  pattern  and  word-pair  grammars,  the  sentence 
err..r  rate  and  test  set  perplexity  are  also  given.  For  the 
null  grammar  case,  the  sentence  error  rate  is  near  90%. 
and  the  perplexity  =  1000. 

6  Discussion 

In  our  experience,  average  word  error  (£)  for  a  set  of  speak¬ 
ers  can  be  estimated  as  a  function  of  perplexity  \L)  by: 

E  =  a\  L  (4) 

Figure  1  indicates  that  q  5:  1  for  this  data  set  over  most 
of  the  perplexity  range.  We  have  conducted  numerous  ex¬ 
periments  on  speech  collected  at  BBN  in  normal  office  en¬ 
vironments.  The  experiments  have  used  a  variety  of  gram¬ 
mars  including  those  reported  here.  We  consistently  find 
the  average  word  error  to  be  reasonably  predicted  by  using 
o  =  5  which  is  haif  the  error  rate  obtained  for  the  TI  speak¬ 
ers.  The  difference  in  average  performance  between  the  TI 
and  BBN  data  may  be  explained  by  differences  in  speaking 
style  and  rate.  The  speakers  collected  at  BBN  have  some 
experience  with  speech  recognition  systems  and  generally 
speak  more  clearly  than  the  speakers  collected  at  TI. 

While  the  average  performance  is  generally  predicted  by 
perplexity,  an  individual  speaker  s  performance  may  not  be. 
For  example,  speaker  DTB  performs  far  below  average  for 
the  null  grammar  but  above  average  for  the  word-pair  and 
pattern  gramrr  trs.  Similarly,  the  performance  for  RKM  on 
the  word-pair  grammar  is  far  worse  than  would  be  predicted 
from  his  results  on  the  pattern  or  null  grammar. 

It  is  clear  from  these  results  that  performance  can  be 
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8 
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62 
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20 

7 

9.3 
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00 
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65.4 
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0.5 
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4 

10 

5.4 

96.5 

34 

39.4 

63.1 

DTD 

1.0 

99.0 

8 

8 

6.7 

94.2 

44 

54 

26.7 

75.3 

JWS 

0.9 

99.1 

8 

9 

4.3 

96.2 

28 

59 

25.6 

75.4 

PGH 

0.5 

99.5 

4 

9 

6.0 

96.0 

24 

56 

32.0 

70.5 

RKM 

2.4 

98.1 

16 

10 

16.4 

89.7 

52 

64 

30.5 

71.8 

TAB 

0.5 

100.0 

-1 

9 

3.2 

97.7 

20 

67 

24.8 

76.5 

avg 

ns 

99.1 

10.5 

9 

ns 

94.8 

37.0 

62 

32.4 

70.1 

Table  1:  Recognition  Performance  by  Speaker  for  three  grammar  conditions. 


made  arbitrarily  high  by  lowering  the  grammar  perplexity. 
For  large  vocabulary,  complex  task  domain  applications, 
however,  low  perplexity  grammars  are  likely  to  be  too  re¬ 
strictive  for  real  use.  We  expect  that  habitable  grammars 
for  tOOO  word  tasx  do.v.a'n  applications  will  require  per¬ 
plexities  larger  than  50. 
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Measuring  Perplexity  of  Language  Models 
Used  in  Speech  Recognizers 


Salim  Roucos 
BBN  Laboratories 
Cambridge,  MA  02238 


In  this  note,  we  define  one  measure  of  perplexity  of  a  language  model  and  present  a  method  for 
computing  it.  We  hope  that  by  agreeing  on  a  common  method  of  measuring  perplexity,  it  will 
become  easier  to  compare  speech  recognition  results  when  different  language  models  are  used. 

1.  Test-set  perplexity 

We  describe  a  measure  for  characterizing  the  complexity  of  a  language  model;  we  call  the  measure 
test-set  perplexity.  This  measure  of  perplexity  is  defined  for  any  specific  set  of  sentences  and  a 
given  language  model.  In  general,  the  word  accuracy  of  a  speech  recognizer  using  a  given  language 
model  is  expected  to  decrease  as  the  test-set  perplexity  of  a  set  of  test  sentences  increases.  Knowing 
both  the  recognition  performance  and  test-set  perplexity  will  help  in  comparing  recognition 
algorithms  that  use  different  language  models. 

A  language  model  is  defined  by  the  set  of  probabilities  Q(w}...wn)  for  all  word  sequences 
wj...wn.  Given  a  language  model  Q(.),  the  test-set  perplexity  of  a  set  of  sentences  is  defined  as 

l  =  2k  (1) 

where  K,  the  average  per  word  log  probability  (called  logprob),  is  given  by 
K  =  -  1/n  log2[  Q(w1w2...wn)]  (2) 

where  wj...wn  represents  the  sequence  of  words  in  all  the  sentences  of  the  test  set,  Q(wj...wn) 
is  the  language  model  probability  of  the  word  sequence.  The  word  sequence  wj...wn  is  obtained 

from  a  test  set  by  concatenating  all  test  sentences  separated  by  sentence  boundary  markers.  We  note 
that  the  real  probability  of  the  word  sequence  wj...wn  is  denoted  by  P(wj...wn)  and  that  we  will 

discuss  later  in  this  note  the  relationship  of  P  and  Q.  For  the  special  case  of  a  language  model 
which  assumes  all  the  words  from  a  vocabulary  of  size  V  are  equally  likely  to  occur  at  any  point, 
i.e.,  Q  (wj...wn)  =  V*n,  the  average  logprob  is  log  V  and  the  test-set  perplexity  equals  the 

vocabulary  size  V  for  any  test  set  from  any  source  P(.);  hence,  the  interpretation  that  test-set 
perplexity  corresponds  to  the  "average  branching”  of  the  language  model  along  the  test  set. 


2.  Computing  test-set  perplexity 

Equation  2  can  be  rewritten  as 
~n 

K=-l/nL  log2[Q(wj  I 
i=l 

where  wj  denotes  the  word  sequence  wj...Wj.  In  this  case,  we  have  factored  the  joint  probability 
as  the  product  of  the  conditional  probabilities  Q(wj  I  which  represent  the  probability  that 
word  wj  will  appear  next  given  all  the  text  up  to  word  Wj.j.  We  include  a  special  symbol  to  delimit 

sentence  boundaries  and  this  symbol  counts  as  one  word.  All  other  delimiters  (such  as  commas, 
etc.)  are  dropped.  So,  if  a  sentence  consists  of  m  english  words,  it  accounts  for  m+1  symbols  in 
the  above  logprob  computation.  The  vocabulary  includes  the  sentence  boundary  marker  as  one 
item. 
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The  factored  form  is  convenient  for  the  usual  finite-state  N-gram  models  and  for  the  deterministic 
grammars  that  have  been  used  in  speech  recognizers.  For  the  case  of  deterministic  finite  state 
grammars,  where  a  word  sequence  is  either  accepted  or  rejected  as  a  legal  sentence  of  the  language, 
it  is  important  to  make  an  assumption  about  the  value  of  the  conditional  probability  Q(wj  I  w^'1).  In 

th“  absence  of  other  information,  we  assume  that  all  the  legal  words  that  can  follow  a  partial  word 
string  are  equally  likely.  To  determine  the  number  of  distinct  legal  words  that  can  follow  a  partial 
string  of  words,  we  need  a  network  representation  in  the  form  of  a  deterministic  finite  state  machine 
which  means  that  all  arcs  from  a  node  represent  a  possible  word  to  follow  (no  null  arcs)  and  that  no 
two  arcs  leaving  a  node  have  the  same  word  associated  with  them.  There  is  a  standard  algorithm  for 
converting  a  non-deterministic  representation  into  a  deterministic  representation  (see  Aho  & 
Ullman). 

For  more  general  formal  languages  such  as  context-free  grammars,  augmented  transition  network 
grammars,  etc.,  one  needs  to  determine  all  partial  parses  up  to  word  Wj.j  and  count  how  many 

distinct  words  can  follow  after  word  Wj.j.  Then,  using  an  assumption  that  all  choices  are  equally 

likely,  the  test-set  perplexity  is  the  geometric  mean  of  the  number  of  word  choices  possible  along 
the  test  set. 

The  factoring  of  the  joint  probability  can  be  done  in  at  least  two  directions:  forward  as  Q(wj  1  ), 

or  backward  as  Q(w^  I  wp+j).  With  the  deterministic  finite  state  model  and  the  uniform  assumption 

of  equally  likely  words  out  of  a  node  (forward  and  backward),  the  forward  and  backward 
perplexities  for  the  same  test  set  of  legal  sentences  are  not  the  same  because  we  have  two  different 
statistical  models.  Typically,  we  compute  the  perplexity  of  the  forward  (left-to-right)  language 
model. 

The  same  test  set  should  be  used  to  compute  the  perplexity  and  to  measure  the  recognition 
performance.  Note  that  perplexity  depends  not  only  on  the  language  model  but  also  on  the 
particular  test  set  (in  the  limit  of  large  test  sets,  the  variance  of  the  test-set  perplexity  approaches 
zero). 

3.  Relation  to  Entropy 

For  a  given  language  model,  the  test-set  perplexity  is  a  random  variable  that  depends  on  the  actual 
test  set.  For  a  test  set  wj...wn  obtained  from  a  well  behaved  source  (ergodic)  with  probability 

P(wi...wn),  the  time  average  of  the  logprob  converges  to  its  expected  value  with  large  n: 

Lim  -1/n  ^  P(w1...wn)  log2[Q(wj...wn)]  =  Lim  -1/n  log2Q(wp..wn) 

*TV  — »  ao  -w— >00 

wl-wn 

where  the  summation  is  over  all  sequences  W2...wn.  Since  P(.)  is  unknown,  the  right  hand  side  is 

particularly  useful  because  a  large  test  set  is  sufficient  to  compute  an  estimate  of  the  expected  value. 
Note  if  Q  =  P,  which  is  true  when  we  know  the  correct  language  model,  then  for  large  n  the  test-set 
perplexity  approaches  the  source  language  perplexity  given  by  2^  where  H  is  the  entropy  of  the 
language.  When  Q  /P,  the  expected  test-set  perplexity  will  be  larger  than  the  language  perplexity 
2^  for  any  n.  The  goal  in  building  language  models  is  to  minimize  the  difference  between  the 
expected  test-set  perplexity  and  the  language  perplexity.  Note  that  for  small  n,  K  may  sometimes  be 
less  than  H. 
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Appendix 

Given  a  language  that  allows  the  following  9  sentences: 

aa. 

ab. 

ac. 

aba. 

abb. 

abc. 

aca. 

acb. 

acc. 

One  can  use  the  following  non-deterministic  finite  state  automaton  to  efficiently  represent  the 
language: 


To  compute  the  test-set  perplexity  for  the  two  sentences  aa.  and  abc.  we  use  the  equivalent 
deterministic  finite  state  automaton  for  the  language: 


The  test-set  perplexity  for  the  two  sentences  "aa."  and  "abc."  is  given  by: 

1-  concatenate  as  aa.abc.  a  seven  word  long  test  set. 

2-  perplexity  is  L=  1/7  log2(lx3xlxlx3x4xl)=  1.668 

Therefore,  on  average  the  branching  is  1.67  words. 


Presented  at  the  IEEE  International  Conference  on  Acoustics,  Speech, 
and  Signal  Processing,  May  23-26,  1989,  Glasgow,  Scotland. 
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Abstract 

In  this  paper  we  present  several  techniques  to  improve  the 
algorithm  presented  last  year  for  speaker-adaptive  training  in 
continuous  speech  recognition  The  previous  method  uses  a 
transformation  matrix  to  modify  the  hidden  Maikov  Model 
(HMM)  parameters  of  a  pte -chosen  prototype  speaker  to  model 
a  target  speaker.  To  estimate  the  transformation  matrix,  it 
aligns  a  set  of  target  speech  with  the  same  set  of  speech  uttered 
by  the  prototype  speaker  using  dynamic  time  waging.  We 
focus  on  improving  the  previous  method  in  the  modeling  of  the 
spectral  differences  between  two  speakers,  and  the  accuracy  of 
the  alignment.  To  improve  the  modeling  of  the  spectral 
differences,  we  implemented  a  phoneme -dependent  mapping 
procedure  which  transforms  the  prototype  HMMs  to  the 
estimated  target  HMMs  using  a  set  of  phoneme -dependent 
matrices.  To  improve  the  alignment,  we  developed  a  modeling 
of  the  silence,  a  linear  duration  normalization,  and  an  iterative 
normalization  procedure.  We  tested  the  new  methods  in  the 
standard  DARPA  database  with  a  grammar  of  perplexity  60. 
The  performance  shows  a  3091  word  error  reduction  compared 
with  that  of  the  previous  algorithm. 

1.  Introduction 

Hidden  Markov  Modeling  techniques  have  enjoyed  great 
success  in  large-vocabulary  contuv-'Mis  speech  recognition 
using  speaker-dependent  or  speaker-independent  training  To 
achieve  state-of-the-art  performance  in  large  vocabulary  tasks, 
it  has  been  necessary  to  collect  a  large  amount  of  speech  (-30 
min)  from  each  target  speaker  for  speaker-dependent  training, 
or  from  a  large  number  of  speakers  (>100)  for  speaker - 
independent  training  It  is  not  feasible  to  go  through  such  a 
long  and  tedious  recording  process  in  some  applications.  The 
speaker-adaptive  training  paradigm  we  have  been  advocating  is 
designed  to  alleviate  this  difficulty.  Our  current  method 
requires  collecting  30  minutes  of  speech  from  only  one 
prototype  speaker,  and  a  small  amount  of  speech  (typically  two 
minutes)  from  each  target  speaker.  Our  long-term  goal  for 
speaker-adaptive  training  is  to  achieve  recognition  accuracy 
with  two  minutes  of  adaptation  speech  equivalent  to  that  of  30- 
minute  speaker -dependent  training. 

At  ICASSP'87  [I],  we  presented  our  basic  procedure  for 
speaker-adaptive  training,  which  uses  a  speaker  transformation 
on  the  phonetic  hidden  Maikov  models  of  a  prototype  speaker. 
Starting  with  the  trained  HMM  parameters  of  a  prototype 
speaker  and  two  minutes  of  speech  from  a  new  target  speaker, 
we  estimate  a  maximum  likelihood  probability  transformation 
matrix  by  aligning  the  target  speech  against  the  prototype 
models  using  the  forward-backward  algorithm  The 
probability  transformation  matrix  defining  a  probabilistic 
mapping  between  the  prototype  speaker  and  the  target  speaker 


is  used  to  transform  the  prototype  HMM  models  to  the 
estimated  HMM  models  for  the  target  speaker. 

At  ICASSP'88  [2],  we  proposed  an  improved  algorithm 
which  aligns  the  target  speech  against  the  tame  sentences 
uttered  by  the  prototype  speaker  using  a  dynamic  time  warping 
(DTW)  algorithm  to  compute  the  co-occurring  spectral  pairs. 
We  then  estimate  the  transformation  matrix  by  counting  the 
spectral  co-occurrences  for  the  two  speakers  This  algorithm 
worked  much  better  than  the  previous  algorithm,  but  the 
performance  was  significantly  worse  for  some  speakers  than 
for  others.  There  are  two  possible  reasons  for  the  degraded 
recognition  performance:  (1)  a  single  transformation  matrix  is 
not  enough  to  model  the  spectral  differences  between  the  target 
speaker  and  the  prototype  speaker,  and  (2)  the  DTW  algorithm 
had  not  found  phonetically  "correct"  alignments  between  the 
target  speech  and  the  prototype  speech,  thus  leading  to  an 
inferior  estimate  of  the  transformation  matrices. 

To  improve  the  modeling  of  the  spectral  differences 
between  two  speakers,  we  implemented  a  phoneme -dependent 
transformation  procedure,  which  uses  a  set  of  transformation 
matrices  to  transform  the  prototype  HMM  parameters  to  model 
the  target  speaker.  Each  transformation  matrix  represents  a 
different  probabilistic  spectral  mapping  for  each  phoneme 
between  two  speakers. 

We  believe  that  the  inferior  alignments  result  from  several 
facts.  First,  frequently  one  speaker  inserts  a  long  pause 
between  two  words  in  a  sentence  when  the  other  speaker  has 
not.  Second,  the  alignment  accuracy  degrades  when  the 
duration  of  the  prototype  sentence  is  very  different  from  that  of 
the  aligned  target  sentence.  TLrd.  the  spectral  spaces  of  the 
two  speakers  may  be  very  different  To  improve  the 
alignment,  we  propose  a  modeling  of  the  silence,  a  linear 
duration  normalization  before  computing  the  alignment,  and  an 
iterative  normalization  procedure  to  compute  the  alignment 

Paper  Outline 

In  Section  2,  we  briefly  introduce  out  basic  speaker- 
adaptive  training  approach  using  phoneme -dependent 
mappings.  In  Section  3,  we  propose  several  techniques  to 
improve  the  alignment:  a  modeling  of  the  silence  in  the  target 
speech  and  a  linear  duration  normalization  before  computing 
the  alignment,  and  an  iterative  normalization  algorithm  to 
compute  the  alignment  and  estimate  the  mappings.  We  show 
mat  this  algorithm  converges  to  a  local  minimum  of  the  mean- 
squared  error  for  the  alignment.  To  evaluate  the  proposed 
algorithm,  we  then  present  in  Section  4  speaker-adaptive 
recognition  results  using  two  minutes  of  target  speech  on  the 
standard  DARPA  database,  compared  to  the  performance  of 
28-minute  speaker -dependent  training. 
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2.  Phoneme-Dependent  Mappings 


In  this  section  we  describe  our  speaker  adaptation 
procedure  using  phoneme  -dependent  mappings.  For  each  state 
of  a  discrete  HMM,  we  have  a  discrete  probability  density 
function  (pdf)  defined  over  a  fixed  set  N  of  spectral  templates. 
We  can  view  the  discrete  pdf  for  each  state  s  as  a  probability 
row  vector  g(J)=(p(*1/.r)^K±2/r),..vHly'r)]  where  p(kjs)  is  the 
probability  of  spectral  template  *,  at  state  r  of  the  HMM  model 

If  we  denote  a  quantized  spectrum  from  the  prototype 
speaker  as  4,,  1  £  i  S  N,  and  from  the  target  speaker  as  k'j. 
1  ijiS,  where  r  and  j  are  indices  denoting  the  quantized 
spectra,  then  we  can  denote  the  probability  that  the  target 
speaker  will  produce  quantized  target  spectrum  k’j,  given  the 
prototype  speaker  produced  spectrum  kt,  as  pi,? /*,)  for  all  ij- 
The  probabilistic  mapping  can  be  defined  as  follows: 

N 

p(k'/s)='£rKkl/s)Hk’/kl),  ISjSN  (1) 


The  probabilities  pik’Jk^  for  all  i  and  j  form  an  NxN  matrix,  T, 
which  can  be  interpreted  as  a  probabilistic  transformation 
matrix  from  one  speaker’s  spectral  space  to  another's  at  each 
state  We  can  then  compute  the  discrete  pdf  g'(s)  ai  state  s  for 
the  target  speaker  as  the  product  of  the  row  vector  {Hr)  and  the 
matm  T 

2'(j)=p(i)  x  T,  where  jkt)  (2) 


In  equations  (1)  and  (2).  we  assume  that  the  probability 
for  spectrum  k'  given  k  is  independent  of  s,  which  indicates  that 
a  single  probabilistic  spectral  mapping  will  transform  the 
speech  of  one  speaker  to  another.  However  in  practice,  some 
of  the  differences  between  speakers  can  not  be  modeled  by  a 
single  transformation  To  have  a  more  detailed  modeling  of 
the  spectral  differences  between  two  speakers,  we  define  a 
phoneme -dependent  mappings: 


pik' yr)=^/KVs)p(*y*,,b(r)) 


(3) 


where  ♦(s)  specifies  an  equivalence  class  of  states  in  models 
thai  represent  the  same  phoneme  as  s. 

Since  only  two  minutes  of  target  speech  is  available,  it  is 
not  adequate  to  estimate  reliable  probabilistic  mappings  for  all 
phonemes  Therefore,  m  transforming  the  HMM  models,  we 
interpolate  the  phoneme-dependent  mappings  with  the 
phoneme -independent  mappings  to  improve  tl*  robustness  of 
the  adapted  HMM  models  The  weight  for  the  combination  is 
different  for  each  prototype  spectral  index  (each  row  of  the 
transformation  matrices),  and  it  is  dependent  on  the  number  of 
occurrences  of  the  observed  prototype  specrum  for  that 
phoneme  in  the  adaptation  speech.  The  important  step  in  the 
phoneme -dependent  mapping  procedure  is  to  estimate  the 
p(,k'j/klMs))  that  optimizes  the  recognition  performance.  In 
next  section,  we  will  describe  an  iterative  algorithm  to  estimate 
the  phoneme -dependent  mappings. 


3.  Improving  the  Alignment 


3.1  Before  Computing  the  Alignment 
Insertion  of  Silence  into  Prototype  Speech 

When  two  different  speakers  read  the  same  sentence,  they 
may  insert  pauses  (silence)  in  different  places  of  the  sentence. 
In  this  case,  the  DTW  is  forced  to  align  silence  frames  to 
speech  frames  resulting  in  phonetically  incorrect  alignments. 
To  achieve  correct  alignments  for  target  speech  with  arbitrary 
inter-word  pauses,  we  insert  a  synthesized  silence  spectrum 
between  each  word  of  the  prototype  speech  We  compute  the 
synthesized  silence  spectrum  for  each  utterance  as  the  average 
of  the  spectral  parameters  over  several  frames  of  silence  from 
the  target  speaker. 

Linear  Warping  the  Target  Spectra 

The  warping  function  produced  by  DTW  can  be  viewed 
as  a  mapping  from  the  time  axis  of  one  partem  onto  that  of 
another  To  align  two  sets  of  spectra  with  the  same  text,  it  is 
reasonable  to  use  DTW  with  slope  constraints  [S],  so  as  to 
avoid  unrealistic  mappings,  for  example  that  one  target  frame 
gets  aligned  to  many  prototype  frames.  However  if  the 
durations  of  the  two  aligned  sentences  are  very  different,  the 
slope -constraint  may  interfere  with  their  alignment  accuracy. 
Since  our  prototype  speaker  speaks  slowly  and  carefully,  large 
speaking  rate  differences  can  exist  in  words  or  word  sequences 
which  are  spoken  rapidly  and  casually  by  the  target  speakers. 
To  overcome  this  problem,  we  linearly  warp  the  spectral 
sequence  of  each  target  sentence  to  be  the  same  length  as  the 
corresponding  prototype  spectral  sequence.  We  perform  the 
lineal  warping  on  the  target  spectral  sequence  by  copying  the 
target  spectral  frame  that  is  closest  to  a  linear  time  scaling  It 
avoids  losing  detailed  information  m  the  original  target  spectral 
sequence  due  to  interpolation 

3.2  Computing  the  Alignment  Iteratively 

When  the  target  spectral  space  is  very  different  from  the 
prototype  spectral  space,  the  minimum  error  alignment 
produced  by  the  DTW  algorithm  may  not  correspond  to  the 
phonetically  correct  alignment  To  improve  the  accuracy  of 
the  alignment,  we  developed  an  iterative  alignment  and 
normalization  procedure  In  each  iteration  of  the  algorithm,  we 
align  the  target  speech  to  the  prototype  speech,  then  using  the 
alignment  we  shift  each  target  spectral  frame  in  the  target 
speech  by  an  amount  that  is  dependent  on  the  index  of  the 
corresponding  vector-quantized  value  of  the  prototype  frame  to 
which  it  aligns.  Using  the  shifted  values  of  the  target  frames, 
we  realign  die  target  speech  to  the  prototype  speech,  etc.  We 
will  prove  in  Section  3.3  that  this  algorithm  converges  to  a 
local  minimum  in  the  mean-squared  error  (mse)  for  the 
alignment  The  mse  of  a  given  alignment  is  equal  to  the 
average  of  the  squared-difference  of  the  target  spectral  frames 
and  the  corresponding  prototype  spectral  frames  in  the 
alignment  path.  The  formal  definition  of  the  mse  given  an 
alignment  will  be  defined  later  in  equation  (4).  The  detailed 
algorithm  is  describe  as  follows: 

1.  Gassify  (quantize)  each  frame  of  the  prototype 
spectra  into  one  of  2M  VQ  regions  using  a  well- 
trained  M-bit  prototype  VQ  codebook. 


We  present  several  techniques  to  improve  the  alignment 
accuracy.  In  Section  3.1  we  introduces  two  processes  before 
performing  the  alignment.  Then  in  Section  3.2  we  describe  a 
new  iterative  algorithm  to  compute  the  alignment.  We  deal 
with  the  convergence  of  the  iterative  algorithm  in  Section  3.3.  . 


2.  Align  each  target  spectrum  (x)  in  the  adaptation 
sentences  to  a  prototype  spectrum  (y)  using  a 
slope -constrained  DTW.  If  the  mse  of  the 
alignment  is  similar  to  the  mse  of  the  alignment 
resulting  from  the  previous  iteration,  then  the 
algorithm  converges  and  we  stop. 
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(7) 


3.  Classify  each  frame  of  the  target  spectra  into  one 
of  the  prototype  VQ  regions  according  to  the 
classification  of  its  aligned  prototype  frame. 

4.  Compute  the  mean  of  |x|  and  {y|  in  each  of  the 
prototype  VQ  regions. 

3  Shift  each  frame  of  the  target  spectra  by  the 
difference  vector  between  the  mean  of  (x)  and 
(y  |  for  that  VQ  region 

6.  Substitute  the  target  spectra  by  the  shifted  target 
spectra,  and  go  to  step  2. 

Estimation  of  Phoneme-Dependent  Mappings 

From  the  alignment  of  the  last  iteration,  we  count  the  co¬ 
occurrences  of  the  quantized  spectral  indices  for  each  of  the 
frames  in  the  adaptation  sentences,  and  form  the  co-occurrence 
matrix  N(+(s))  for  each  phoneme,  where  each  element  N,y  is 
the  number  of  the  co-occurrences  of  the  target  spectra  k'j  and 
the  prototype  spectra  k,.  Then  we  normalize  the  rows  of 
N(4(s))  to  form  the  phoneme -dependent  transformation 
matrices  T(+(s)). 

Below  we  prove  that  the  iterative  algorithm  is  guaranteed 
to  converge  to  a  local  minimum  in  the  mse  of  the  alignment. 

3J  Convergence  of  the  Iterative  Algorithm 
C.  W  Minimizes  the  Mse 

Suppose  |x,,  1  5  i  £  /|  is  an  I-frame  target  spectral 
sequence  for  a  sentence  and  {  v  1  £  j  <,  /)  is  a  J-frame 
prototype  spectral  sequence  for  the  same  sentence  As  a 
measurement  of  the  difference  berween  two  feature  vectors  x, 
and  yy  a  Euclidean  distance  d(c)=d(i</)=llx,-yJJI  is  employed 
berween  them  Using  a  dynamic  time  warping  (DTW) 
algorithm  [5],  we  obtain  a  warping  function  F  = 

c(l),c(2) . c(k) . c(K).  where  c(k)=(i(k)j(k)),  by  minimizing 

the  accumulated  mse  between  (x)  and  (v),  which  is 
K  K 

E(D=( 1/AD  £  [d(c{*))2H  1/AD  £  (4) 

t-i 

The  warping  function,  which  is  also  called  the  alignment, 
realizes  a  mapping  from  the  tune  axis  of  |x(,  1  £  i  <,  I)  on  that 
of  {y;,  1  £  j  S  /) .  The  alignment  indicates  the  matched 
prototype  spectral  frame  given  each  target  spectral  frame 

Shifting  Target  Spectrum  Reduces  mse  Further 


By  letting  equation  (6)  equal  to  zero,  we  obtain 
t  t 

Zq  *  ( l/KDgT  (1/JT)^  f/t) 

Therefore  shifting  the  target  spectral  frames  by  the  difference 
of  the  means  of  (x)  and  |y  |  minimizes  the  total  mse  of  the 
given  alignment. 

It  is  very  possible  that  the  mean  difference  between  two 
spectral  spaces  varies  across  different  VQ  partitions  To  make 
a  more  complex  modeling  of  two  spectral  spaces,  we  shift  the 
target  spectrum  osing  a  set  of  VQ-dependent  vectors 
lxfl)x(2),.. . ^2*^)1.  The  total  mse  can  be  represented  by  the 
following  equation 

2“  an  m  ,A  , 

£(/>£  a/AV»^  'Kx^-xfO-y^  (8) 

By  following  the  same  approach  as  described  above,  we 
achieve 

g  g 

*/Kl /*(/)>£  *2,-  (1  /*(0)£,  yjiy  1  s  /  s  2"  (9) 

which  minimize  both  the  total  mse  and  the  mse  for  each  VQ 
region. 

Convergence  of  Mse 

In  the  first  iteration,  step  two  of  the  algorithm  produces 
the  first  alignment  between  the  prototype  speech  and  the  target 
speech  by  minimizing  the  mse.  Shifting  the  target  spectra  at 
step  five  by  the  mean  difference  reduces  the  mse  at  each  VQ 
region  independently  and  also  gives  the  minimum  mse  for  that 
alignment. 

After  the  first  iteration,  we  realign  the  shifted  target 
spectra  with  the  original  prototype  spectra,  and  the  obtained 
alignment  may  or  may  not  be  different  from  the  previous 
alignment.  If  the  alignment  is  different  then  the  current  mse 
must  be  smaller  than  that  of  the  previous  iteration  because 
DTW  provides  a  new  alignment  with  the  smallest  mse.  If  the 
alignment  is  the  same  as  the  previous  one,  then  the  mse  is  not 
changed  and  the  algorithm  has  converged  In  the  next  section, 
we  present  some  experimental  results  by  using  the  iterative 
algorithm  with  three  iterations. 

4.  Experimental  results 


Given  the  alignment  (the  paired  target  spectral  frames  and 
prototype  spectral  frames),  we  can  reduce  the  mse  of  the 
alignment  further  by  making  the  target  spectral  space  closer  to 
the  prototype  spectral  space  Suppose  we  shift  each  target 
spectral  frame  by  a  single  vector  Xq.  The  mse  of  the  alignment 
becomes 

K 

E(F>{  i/adJ;  IK  x1(t)- 

=(,/*V£  i 


v-w2 


By  taking  the  partial  derivative  of  E(F)  with  respect  to  x0,  we 

have 


a£(F)/9x0=Ir0-(2/AD  J;  xwHVK}£  ym 


(6) 


4.1  Experimental  Conditions 

In  the  experiments  shown  below,  we  use  the  well -trained 
HMMs  of  a  single  speaker  RS  as  the  prototype.  RS  is  a  careful 
male  speaker  with  a  New  York  dialect.  RS  recorded  600 
sentences  at  BBN  in  normal  office  environment.  The  600 
utterances  constituted  about  30  minutes  of  speech  which  was 
used  to  estimate  the  HMM  parameters  for  the  prototype 
models. 

A  1000- word  database  of  continuously  read  speech  has 
been  designed  and  recorded  within  the  DARPA  Strategic 
Computing  Speech  Recognition  Program  [3).  This  data 
consists  of  read  sentences  which  are  appropriate  in  a  naval 
resource  management  task.  It  was  recorded  in  a  sound-isolated 
recording  booth  at  Texas  Instruments  (TT).  We  use  eight  TI 
speakers  from  this  database  to  test  different  adaptation 
procedures  compared  with  the  performance  of  the  28-minute 
speaker-dependent  training. 
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S.  Conclusions 


In  (be  adaptation  experiments  of  this  section,  we  use 
adaptation  material  of  two  minutes  duration.  We  lowpass 
filtered  both  the  target  speech  and  the  prototype  speech  at  10 
kHz  arel  sampled  at  20  kHz.  Every  sentence  in  the  target 
speech  and  the  prototype  speech  is  represented  by  frames  of 
mel -frequency  cepstral  coefficients  (MFCCs). 

All  the  recognition  experiments  used  a  word-pair 
grammar  of  perplexity  60.  This  grammar  allows  all  rwo- word 
sequences  which  occur  in  the  task  domain  definition  [4],  The 
recognized  sequence  of  words  was  compared  automatically  to 
the  correct  answer  to  determine  the  percentage  of  errors  of  each 
type:  substitutions,  deletions,  and  insertions.  We  use  an  error 
measure  that  reflects  all  three  types  of  errors  in  a  single 
number.  The  percent  error  is  given  by 

Wd  errors  1 00  x  (subsnrurtons+insemons^Unons  > 
total-input-words 


4.2  Results 


Eiparlmanta 

X  Word  Error 

Basalins  (ICASSP  M) 

13.8 

Phonsms-Dapandant  Mappings 

12.1 

Itarativa  Normalization  Algorithm 

9.6 

2t-Mln  Spaakar-Ospandant  Training 

T.l 

Table  L  Percent  word  recognition  error  tor 
2-<nln  speaker-adaptive  training  va.  2*-mln 
speaker-dependent  training  on  I  speakers 


Tabic  I  contains  the  recognition  word  error  rates  using 
different  adaptation  procedures  with  two  minutes  of  target 
speech,  in  comparison  with  the  28-minute  speaker-dependent 
performance  on  eight  TI  speakers  Belo-v  we  discuss  the 
performance  of  each  algorithm,  row  by  row,  un  Table  1 

In  the  first  row.  we  show  the  performance  for  our 
previous  algonthm  [2],  which  uses  a  single  transformation 
matrix  (phoneme -independent  transformation i  to  transform  the 
prototype  HMM  models  to  the  target  HMM  models.  The  word 
error  rate  is  13.8**  which  is  about  two  times  that  of  the  28- 
minute  speaker-dependent  performance. 

By  using  a  set  of  phoneme -dependent  transformation 
matrices  (instead  of  one  matrix )  to  perform  the  transformation, 
we  reduce  the  word  error  rate  to  12.1%  (shown  in  the  second 
row  of  Table  I).  Compared  to  the  performance  of  the  previous 
algorithm,  it  resulted  in  a  12%  error  reduction. 

By  applying  the  silence  modeling,  the  linear  duration 
normalization,  and  the  iterative  normalization  algorithm 
described  in  Section  3  to  improve  the  alignment  and  estimate 
the  phoneme-dependent  mappings,  we  obtain  a  significant 
improvement  from  word  error  12.1%  to  9.6%,  which  is  a  20 
percent  reduction  in  the  word  error  rate. 

The  speaker-dependent  performa  :  with  28  minutes  of 
training  speech  shown  in  table  I  is  7.1%  word  error  using 
steady-state  MFCCs  only.  The  performance  has  been 
improved  to  3.6%  word  error  recently  by  using  additional 
feature  parameters,  including  differential  MFCCs  and  power 
parameters  We  believe  that  further  improvement  on  speaker- 
adaptive  training  can  be  achieved  by  using  more  features. 


In  this  paper  we  presented  several  techniques  to  improve 
the  algorithm  presented  last  year  for  speaker-adaptive  training. 
The  previous  method  uses  a  transformation  matrix  to  modify 
the  hidden  Maikov  Model  (HMM)  parameters  o'  a  pre -chosen 
prototype  speaker  to  model  a  target  speaker  It  estimates  the 
transformation  matrix  by  aligning  a  set  of  target  speech  with 
the  same  set  of  speech  ottered  by  the  prototype  speaker  using 
DTW.  We  focus  on  improving  the  previous  algorithm  by:  (1) 
modeling  the  spectral  differences  between  two  speakers  using  a 
set  of  phoneme -dependent  transformation  matrices,  and  (2) 
improving  the  alignment  accuracy  by  using  a  modeling  of  the 
silence,  a  linear  duration  normalization,  and  an  iterative 
alignment  procedure  To  evaluate  the  effectiveness  of  the  new 
methods,  we  performed  experiments  on  the  standard  DARPA 
database  with  a  grammar  of  perplexity  60  The  recognition 
performance  of  the  new  algorithm  shows  a  30%  word  error 
reduction  compared  with  that  of  the  previous  algonthm 
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